Great work! When I use models like o1, they work better than sonnet and 4o for tasks that require some thinking but the output is often very verbose. Is it possible to get the best of both worlds? The thinking takes place resulting in better performance but the output is straightforward to work with like with sonnet and 4o. Did you observe similar behaviour with the 1B and 3B models? How does the model behaviour change when used for normal tasks that don't require thinking?
Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.
That's a good point. We don't see that in our experiments because it's all in the math domain. However for OAI it's plausible that training for o1 might conflict with standard instruction training, leading to less human preferred output style.
In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.
Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.
OpenAI recommends using o1 to generate the verbose plan and then chain the verbose output to a cheaper model (e.g. gpt-4o-mini) to convert it into structured data / function calls / summary etc. They call it planner-executor pattern. [1]
The big question is whether or not o3 is using any type of “meta-generation” algorithm at inference time, I.e are there multiple invocations of the LLM generation at all, or does it generate an insanely long reasoning trace in a single autoregressive stream that some somehow implicitly has search-like behavior? In other words, is the search-like behavior learned entirely in post-training and only implicitly exhibited at inference time, or is it explicitly done at inference time?
Given the enormous compute costs of o3, my speculation has been that search is explicit, but I’ve seen this post from Nathan Lambert for example that speculates (in the context of o1) that it’s possible for search to be entirely “baked-into” a single single stream roll-out (which would depend on significant long-context innovations):
In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?
Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.
In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.
Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).
Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.
Happy to see "inference time compute" term being used primarily nowadays - it's a much more precise and appropriate term compared to the unwieldy "test-time compute" that openai used to call it when they thought they "invented" scaling inference time.
Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.
Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.
"1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!
"Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.
For problems that require multi-step reasoning, standard LLMs seem to be stuck. The field is increasingly interested in models like o1 that output many "guesses" to find the right one. Currently open-source does not know how to do this, but we are reimplementing several possible directions to try. This replicates one important path using search and a verifier model.
To spend more compute at inference time, at least two simple approaches are readily available:
1) make model output a full solution, step-by-step, then induce it to revise the solution - repeat this as many times as you have token-budget for. You can do this via prompting alone (see Reflexion for example), or you can fine-tune the model to do that. The paper explores fine-tuning of the base model to turn it into self-revision model.
2) sample step-by-step (one "thought"-sentence per line) solutions from the model, and do it at non-zero temperature to be able to sample multiple next-steps. Then use verifier model to choose between next-step candidates and prefer to continue the rollout of the more promising branches of "thoughts". There are many many methods of exploring such tree when you can score intermediate nodes (beam search is an almost 50 years old algorithm!).
If I understand correctly, Hugging Face is exploring approaches to tuning the output quality of a given model by tuning how long to let it run.
Normally when you run an LLM, you set your prompt and whatever tunable parameters, and the LLM software (eg. lamma.cpp) spits out tokens at whatever rate it can. If you want higher quality, you run a bigger model (though you're limited by the amount of memory you have available). If you want higher speed, you run a smaller model. Hugging Face seems to be looking at ways to make this tradeoff without switching between different models.
They show Llama 3.2 1B with chain-of-thought that outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1 70B. It’s less clear whether you actually inference time is faster for CoT 3B using 256x generations vs 70B if you have enough RAM. Basically a classical RAM/compute trade off
From a practical standpoint, scaling test-time compute does enable datacenter-scale performance on the edge. I can not feasibly run 70B on my iphone, but I can run 3B even if takes a lot of time for it to produce a solution comparable to 70B's 0-shot.
I struggle with this idea of "run it long enough", or another description I have heard "give the model time to think" it's not a thing - it takes as long as it takes. What im taking away from this is two things:
1. the reason for generalizations like 'long enough' and 'think more' are apparently because the methods are somewhat obscure
2. those methods are being explored by hugging face to make them less obscure
am I getting that right? I have been struggling to see past the metaphors and understand exactly what additional computation is being done - and here I read its something like multiple guesses being fed back in and chosen among which means its just multiple inferences in series that are all related to solving 1 problem.
Happy to answer any questions about these methods.