How would you detect this? I always wonder about this when I see a 'jail break' or similar for LLM...
The actual system prompt, the “public” version, and whatever the model outputs could all be fairly different from each other though.
How would you detect this? I always wonder about this when I see a 'jail break' or similar for LLM...