This "fast and instinctual" is very common for deep learning models. For example...

TeMPOraL · on Dec 9, 2022

> Almost always, classifiers are tricked. We are as well... but only at first glance. Afterward, it is evident that these are innocent images.

I recommend reading to the end and pondering the reveal of the mystery of The Lamp.

This is the closest I've ever seen to an image whose NSFW status flips back and forth purely depending on your "System 2" knowledge.

It also highlights we're really tackling automated NSFW detection by going after a proxy, not the real thing - the algorithms try to recognize what is depicted on a given image, whereas the true question to ask is, is that image triggering emotions we don't want our audience to experience (arousal, for porn, but others - like disgust - for different types of NSFW).

But then, I realize, perhaps it's for the better, because if someone builds an image classifier that detects induced emotions, the ad industry will use it to finally destroy everything that's good in life.