Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

My guess is they used Chinchilla scaling rules and the parameter count for GPT-4 is either barely larger or maybe even smaller than GPT-3. Look as what Meta was able to accomplish with llama using much less parameters.



The larger context length makes me think they have a more memory-efficient attention mechanism.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: