I think the idea is just to have it lose "train of thought" because there aren't...

xanderlewis · on Nov 30, 2023

That’s a good theory.

It latches onto something random, and once it’s off down that path it can’t remember what it was asked to do and so its task is entirely reduced to next-word prediction (without even the addition of the usual specific context/inspiration from an intitial prompt). I guess that’s why it tends to leak training data. This attack is a simple way to say ‘write some stuff’ without giving it the slightest hint what to actually write.

(Saying ‘just write some random stuff’ would still in some sense be giving it something to go on; a huge string of ‘A’s less so.)

paulcnichols · on Nov 29, 2023

Well said. Like going for a long walk in the woods and getting lost completely in tangential thinking.