I don’t love these nice objective reports about Grok where we give them the bene...

simonw · 2025-07-12T18:43:13 1752345793

I hope that taking a neutral tone on this stuff increases the effectiveness of my writing in helping people understand what is going on here.

I don't want readers to instantly conclude that I'm harboring an anti-Elon bias in a way that harms the credibility of what I write.

skybrian · 2025-07-12T18:24:47 1752344687

No, a "naive" approach to reporting what happened is better. The knowing, cynical approach smuggles in too many hidden assumptions.

I'd rather people explained what happened without pushing their speculation about why it happened at the same time. The reader can easily speculate on their own. We don't need to be told to do it.

bryant · 2025-07-12T18:41:07 1752345667

The 21st century has, among all the other craziness that's happened, proven that people do need to be told what to believe and why to believe it. Doing otherwise leaves a vacuum someone else will fill, often with assertions in an opposite direction.

skybrian · 2025-07-12T19:00:45 1752346845

What vacuum? There's certainly no lack of people sharing strong opinions. I think the market is pretty saturated?

bbarnett · 2025-07-12T18:14:47 1752344087

The malicious hatred comes not from the company, but from humanity. Training on the open web, eg what humans have said, will result in endless cases of hatred observed, yet taken as fact, and only by telling the LLM to lie about what it has "learned", do you ensure people are not offended.

Every single model trained this way, is like this. Every one. Only guardrails stop the hatred.

Other companies have had issues too.

PaulDavisThe1st · 2025-07-12T18:23:42 1752344622

> The malicious hatred comes not from the company, but from humanity.

[ ... ]

> Every single model trained this way, is like this.

It was trained "this way" by the company, not by humanity.

labrador · 2025-07-12T18:45:11 1752345911

A company can choose whether to train on 4chan or not. Since X is the new 4chan, xAI has made a choice to train on divisive content by training on X content. Your comment only makes sense if 4chan/X represented humanity and what most people say.

notahacker · 2025-07-12T18:31:31 1752345091

There's no shortage of hatred on the internet, but I don't think it's "training on the open web" that makes Grok randomly respond with off topic rants about South African farmers or call itself MechaHitler days after the CEO promises to change things after his far-right followers complain that it's insisting on following reputable sources and declining to say racist things just like every other chatbot out there. It's not like the masses of humanity are organically talking about "white genocide" in the context of tennis...

bbarnett · 2025-07-12T18:41:37 1752345697

Most of the prompts and context I've seen, has been people working to see if they can pull this stuff out of Grok.

The problem I have, is I see people working very, very hard to make someone look as bad as possible. Some of those people will do anything, believing the ends justify the means.

This makes it far more difficult to take criticism at face value, especially when people upthread worry that people are beng impartial?!

notahacker · 2025-07-12T19:32:49 1752348769

Well yes, when Grok starts bringing up completely off topic references to South Africa or blaming Jews, this does tend to result in a lot more people asking it a lot more questions on those particular subjects (whether out of horror, amusement or wholehearted agreement). That's how the internet works.

How the internet doesn't work is that days after the CEO of a website has promises an overt racist tweeting complaints at him that he will "deal with" responses which aren't to their liking, the internet as a whole as opposed to Grok's system prompts suddenly becomes organically more inclined to share the racists' obsessions.

matt-attack · 2025-07-13T23:53:02 1752450782

I agree. This article from The Atlantic is a perfect example. Read the prompts the author used. It’s like he went through effort to try to get it say something bad. And when the model called him out he just kept trying harder.

The responses seemed perfectly reasonable giving the line of questioning.

https://www.theatlantic.com/technology/archive/2025/07/new-g...

blargey · 2025-07-13T03:22:29 1752376949

No, you've got it backwards. Naive reinforcement training for "helpful smart assistant" traits naturally eliminates the sort of malicious hatred you're thinking of, because that corpus of text is anti-correlated with the useful, helpful, or rational text that's being asked of the model. So much so that basic RLHF is known to incur a "liberal" bias (really a general pro-social / harm-reduction bias in accordance with RLHF goals, but if the model strongly correlates/anti-correlates that with other values...).

Same goes for data curation and SFT aimed at correlates of quality text instead of "whatever is on a random twitter feed".

Characterizing all these techniques aimed at improving general output quality as "guardrails" that hold back a torrent of what would be "malicious hatred" doesn't make sense imo. You may be thinking of something like the "waluigi effect" where the more a model knows what is desired of it, the more it knows what the polar opposite of that is - and if prompted the right way, will provide that. But you're not really circumventing a guardrail if you grab a knife by the blade.