Hacker News new | past | comments | ask | show | jobs | submit login

This is why, whenever I can, I call RLHF/DPO "sequence level calibration" instead of "alignment tuning".

Some precursors to RLHF: https://arxiv.org/abs/2210.00045 https://arxiv.org/abs/2203.16804




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: