I read this post more as a fun thought experiment. Everyone knows Claude isn't s...

I read this post more as a fun thought experiment. Everyone knows Claude isn't sophisticated enough today to succeed at something like this, but it's interesting to concretize this idea of Claude being the manager of something and see what breaks. It's funny how jailbreaks come up even in this domain, and it'll happen anytime users can interface directly with a model. And it's an interesting point that shop-manager claude is limited by its training as a helpful chat agent - it points towards this being a usecase where you'd be better off fine-tuning the base model perhaps.

I do agree that the "blackmailing" paper was unconvincing and lacked detail. Even absent any details it's so obvious they could have easily ran that experiment 1000 times with different parameters until they hit an ominous result to generate headlines.