Show HN: Steiner – An open-source reasoning model inspired by OpenAI o1

schmeichel · 2024-10-22T19:14:46 1729624486

This seems promising! Great work! Any chance there will be a Ollama Modelfile for the masses?

peakji · 2024-10-22T19:18:06 1729624686

GGUF files are available on HF: https://huggingface.co/peakji/steiner-32b-preview-gguf

I haven't personally used Ollama Modelfile, but I think it should be relatively easy to convert from GGUF?

ca_tech · 2024-10-23T13:35:49 1729690549

You can now run any huggingface model using the following command

ollama run hf.co/{username}/{repository}

Example: ollama run hf.co/peakji/steiner-32b-preview-gguf:Q4_K_M

Source: https://huggingface.co/docs/hub/en/ollama

swyx · 2024-10-22T18:34:37 1729622077

advice to OP - you hurt your own credibility posting on medium dot com. just blog on huggingface or substack or hashnode.

peakji · 2024-10-22T18:55:28 1729623328

I'm new here. Just curious, why avoid Medium? Is it a Hacker News thing, or did I miss something?

whatshisface · 2024-10-22T20:00:59 1729627259

Medium doesn't "hurt your credibility" nearly as much as revealing that one's arsenal of litmus tests is suffering from such a paucity of real knowledge that one bases it on the web design, but Medium has a lot of annoying popups. A lot of people like Substack better and they have a paid subscriber thing that works well.

(realistically speaking, experts tend to know less about the blog hosting ecosystem the more they know about their domain)

swyx · 2024-10-22T20:11:13 1729627873

its just a "tell" that you dont mind the poor reader experience and being associated with the rest of low quality slop that is on medium. many of us here have simply given up clicking on anything medium related

mdaniel · 2024-10-23T03:31:18 1729654278

For those similarly allergic to medium: https://scribe.rip/@peakji/a-small-step-towards-reproducing-...

nxobject · 2024-10-22T19:29:35 1729625375

As someone without specific background in the subfield (I do embedded programming) – thanks for spelling out what people "in the know" seem to understand about o1's functioning!

zby · 2024-10-22T16:53:40 1729616020

Can it be mixed with the sampling based approaches from optillm (https://github.com/codelion/optillm)?

peakji · 2024-10-22T17:08:55 1729616935

Approaches like best of n sampling and majority voting are definitely feasible. But I don't recommend trying things related to CoT, as it might interfere with the internalized reasoning patterns.

nwnwhwje · 2024-10-22T17:41:47 1729618907

Silly question time.

Is this a fined tuned LLM, for example drop in replacement for Llama etc.

Or is it some algorithm on top of an LLM, doing some chain of reasoning?

peakji · 2024-10-22T17:47:13 1729619233

It is an LLM fine-tuned using a new type of dataset and RL reward. It's good at reasoning, but I would not recommend to replace Llama for general tasks.

Mr_Bees69 · 2024-10-22T16:15:28 1729613728

Really hope this goes somewhere, o1 without openai's costs and restrictions would be sweet.

peakji · 2024-10-22T16:32:36 1729614756

The model can already answer some tricky questions that other models (including GPT-4o) have failed to address, achieving a +5.56 improvement on the GPQA-Diamond dataset. Unfortunately, it has not yet managed to reproduce inference-time scaling. I will continue to explore different approaches!

swyx · 2024-10-22T18:38:47 1729622327

not sure i understand the rsults. its based on qwen 32b which is 49.49, and your best model is 53.54. results havent shown that your approach adds significant value yet.

can you compare with just qwen 32b with CoT?

peakji · 2024-10-22T18:49:50 1729622990

The result for Qwen2.5-32B (49.49) is using CoT prompting. Only Steiner models do not use CoT prompting.

More importantly, I highly recommend to try these out firsthand (not only Steiner, but all reasoning models). You'll find that these reasoning models can solve many problems that other models with the same parameter size cannot handle. The existing benchmarks may not reflect this well, as I mentioned in the article:

"... automated evaluation benchmarks, which are primarily composed of multiple-choice questions and may not fully reflect the capabilities of reasoning models. During the training phase, reasoning models are encouraged to engage in open-ended exploration of problems, whereas multiple-choice questions operate under the premise that "the correct answer must be among the options." This makes it evident that verifying options one by one is a more efficient approach. In fact, existing large language models have, consciously or unconsciously, mastered this technique, regardless of whether special prompts are used. Ultimately, it is this misalignment between automated evaluation and genuine reasoning requirements that makes me believe it is essential to open-source the model for real human evaluation and feedback."

swyx · 2024-10-23T05:35:43 1729661743

thanks, congrats on shipping.

ActorNightly · 2024-10-22T17:35:17 1729618517

OpenAIs o1 isnt really going that far though. Its definitelly better in some areas, but not overall better.

Im wondering if we can abstract chain of thought further down into the computation levels to replace a lot of matrix multiply. Like smaller transformers with less parameters and more selection of which transformer to use through search.