Hacker Newsnew | past | comments | ask | show | jobs | submit | nshazeer's commentslogin

I am one of the authors of "Outrageously Large Neural Networks". Yes - overfitting is a problem. We employed dropout to combat overfitting. Even with dropout, we found that adding additional capacity provides diminishing returns once the capacity of the network exceeds the number of examples in the training data (see sec. 5.2). To demonstrate significant gains from really large networks, we had to use huge datasets, up to 100 billion words.


Impressive work, mate!

Does mixture-of-experts works well the other way around, as a way to minimize power and hardware in common sized problems ?

And would it work in low resolution networks, like BinaryConnect ?


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: