No, you've got it backwards. Naive reinforcement training for "helpful smart assistant" traits naturally eliminates the sort of malicious hatred you're thinking of, because that corpus of text is anti-correlated with the useful, helpful, or rational text that's being asked of the model. So much so that basic RLHF is known to incur a "liberal" bias (really a general pro-social / harm-reduction bias in accordance with RLHF goals, but if the model strongly correlates/anti-correlates that with other values...).
Same goes for data curation and SFT aimed at correlates of quality text instead of "whatever is on a random twitter feed".
Characterizing all these techniques aimed at improving general output quality as "guardrails" that hold back a torrent of what would be "malicious hatred" doesn't make sense imo. You may be thinking of something like the "waluigi effect" where the more a model knows what is desired of it, the more it knows what the polar opposite of that is - and if prompted the right way, will provide that. But you're not really circumventing a guardrail if you grab a knife by the blade.
Same goes for data curation and SFT aimed at correlates of quality text instead of "whatever is on a random twitter feed".
Characterizing all these techniques aimed at improving general output quality as "guardrails" that hold back a torrent of what would be "malicious hatred" doesn't make sense imo. You may be thinking of something like the "waluigi effect" where the more a model knows what is desired of it, the more it knows what the polar opposite of that is - and if prompted the right way, will provide that. But you're not really circumventing a guardrail if you grab a knife by the blade.