I never understand such remarcs > Given the power requirements per card, a back ...

chongli · on Nov 10, 2019

So what? Training model is the hardest part, then you just reuse results

I doubt anyone is going to want to run a 33GB model on their phone.

So what? I can't run weather simulation on my laptop.

You only need to run the weather simulation once and then broadcast your forecast to everyone’s devices. You can’t do that with NLP. In order to be useful, NLP models need to run on different input data for every user. With a giant 33GB model, that means round-tripping to the data centre.

If you have to run everything in the cloud, your applications are limited. The cost is also very high, given that there are way more user devices than servers in the world. That means you need to build more data centres if you plan to run these giant models for every application you want to offer your users.

phoboslab · on Nov 10, 2019

> I doubt anyone is going to want to run a 33GB model on their phone.

Why not? Many modern phones have upwards of 512GB of storage. 33 GB for a useful model seems entirely reasonable to me.

chongli · on Nov 10, 2019

That’s for one application. Phones have dozens of apps. If they all use different, giant models like this, then 512GB won’t be nearly enough.

Moreover, what is the performance going to be like? It can’t be too spectacular if your model doesn’t fit in RAM. 33GB is manageable on a beefy server with a ton of RAM. You’re not going to have the same luxury on your phone.

The other major aspect of it is memory bandwidth. If the model was designed to run on a high end GPU, with all 33GB stored in graphics memory, then it’s going to perform terribly if it has to be paged in and out of flash on a phone.

sgt101 · on Nov 10, 2019

What are the applications of deep learning that look like weather simulations (as in one run -> results to 10m people?) In my experience deep learning systems are aimed at applications that are single use 1 run -> 1 person.

The training cost is more important than you think as well. To train a model normally requires 10's or 100's of experiments, meaning that we are consuming 30 -> 3k people's carbon, and the application of the model is typically narrow, so we end up doing 4 or 5 projects per year per group... meaning that we could spend 10's of k carbon per team to produce $10m's benefit. I wonder if we can justify this at all?