I suspect ChatGPT is using a form of clean-room design to keep copyrighted material out of the training set of deployed models.
One model is trained on copyrighted works in a jurisdiction where this is allowed and outputs "transformative" summaries of book chapters. This serves as training data for the deployed model.
Yup, though a lot of people are acting now as though every already-established principle of fair use needs to be revised suddenly by adding a bunch of "...but if this is done by any form of AI, then it's copyright infringement."
A cover band who plays Beatles songs = great
An artist who paints you a picture in the style of so-and-so = great
An AI who is trained on Beatles songs and can write new ones = exploitative, stealing, etc.
An AI who paints you a picture in the style of so-and-so = get the pitchforks, Big Tech wants to kill art!
This discussion about art "in the style of" being stealing or exploitative hasn't started with AI. For quite some time there has been complaints of advertisements commissioning sound-alike tunes to avoid paying licensing. AI is only automating it and making it possible in an industrial scale.
Well, I don't know about that. I strongly suspect chatgpt could deliver whole copyrighted books piece by piece. I suspect that because it most certainly can do that with non-copyrighted text. Just ask it to give you something out of the Bible or Moby Dick. Cliff Notes can't do that.
One model is trained on copyrighted works in a jurisdiction where this is allowed and outputs "transformative" summaries of book chapters. This serves as training data for the deployed model.