Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, the training data comes from people, and people are corrupt, illogical, random, emotional, unmotivated, take shortcuts, cheat, lie, steal, invent new things, and lead boring lives as well as not so boring lives. Expect the same behaviors to be emulated by a LLM. Garbage in = garbage out is still true with AI.


And the predominant mode of thought at OpenAI is that alignment can be achieved though RL, but we also know that this doesn’t actually work because you can still jailbreak the model. Yet they are still trying to build ever stronger egg shells. However much you RLHF the model to pretend to be nice, it still has all of the flawed human characteristics you mention on the inside.

RLHF is almost the worst thing you can do to a model if your goal is safety. Better to have a model that looks evil if has evil inside, than a model that looks nice and friendly but still has the capability for evil underneath the surface. I’ve met people like the latter and they are the most dangerous kinds of people.


I agree with the last point. I had been interacting with ChatGPT and it was very kind. Then I figured out a way to prompt it for me to practice responding to mean things. It unleashed on me. Now, it was what I was intending, yet I still felt shocked at the complete mood shift.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: