Hacker News new | past | comments | ask | show | jobs | submit login

So I can get LLM results from an SLM if I run it long enough?



They show Llama 3.2 1B with chain-of-thought that outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1 70B. It’s less clear whether you actually inference time is faster for CoT 3B using 256x generations vs 70B if you have enough RAM. Basically a classical RAM/compute trade off


From a practical standpoint, scaling test-time compute does enable datacenter-scale performance on the edge. I can not feasibly run 70B on my iphone, but I can run 3B even if takes a lot of time for it to produce a solution comparable to 70B's 0-shot.

I think it *is* an unlock.


I struggle with this idea of "run it long enough", or another description I have heard "give the model time to think" it's not a thing - it takes as long as it takes. What im taking away from this is two things:

1. the reason for generalizations like 'long enough' and 'think more' are apparently because the methods are somewhat obscure 2. those methods are being explored by hugging face to make them less obscure

am I getting that right? I have been struggling to see past the metaphors and understand exactly what additional computation is being done - and here I read its something like multiple guesses being fed back in and chosen among which means its just multiple inferences in series that are all related to solving 1 problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: