DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of t... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		alexhutcheson 7 months ago \| parent \| context \| favorite \| on: RLHF Book DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]). In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data. If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF. [1] https://arxiv.org/pdf/2501.12948

bryan0 7 months ago [–]

Yes but these were steps were not used in R1-zero where its reasoning capabilities were trained.

littlestymaar 7 months ago | [–]

And as a result R1-zero is way too crude to be used directly, which is a good indication that it remains relevant.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact