> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time
As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.
So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.
Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.
> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.
That’s what I was thinking as I read the methodology.
If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?
All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?
There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?
The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy
Suppose for simplicity that every sentence in the book is 50 tokens or shorter.
According to the stated methodology, I could give the LLM sentence 1 and have 42% chance of getting sentence 2 recalled. Then I could give it sentence 2 and have 42% chance of getting sentence 3. Therefore, the LLM contains 42% of the book in some sense.
I disagree this is "not really very much". If a person could do this you would undoubtedly conclude that the person read the book.
In fact the number 42% even understates the severity of the matter. Superficially it makes it sound that the LLM only contains less than half of the book. In reality the process I described applies to 100% of the sentences. Additionally I'm guessing that the 58% times where the 50 tokens arent recalled correctly, the outputted token probably have the same meaning as the correct one.
Except it's not what happened, per the article. Instead, they walked down the logits, which is more like asking someone to give 10-20 best guesses for next word, and should one of them match the secret answer, telling them which one is it and asking them to go on with the next word. Seems like a substantially easier task, and most of information is coming from researchers making a choice at every step.
I don’t think this paper proves that, and I don’t think it is in a traditional sense.
It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.
It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?
Not necessarily. Information is always spread between what we'd normally consider "storage medium" and "reader"; the degree to which that is is a controllable parameter.
Consider e.g.:
- Digital expansion of PI to sufficient decimal places contains both parts of the work and full work in full. The trick is you have to know where to find it - and it's that knowledge that's actually equivalent to the work itself.
- Any kind of compression that uses a dictionary that's separate from the compressed artifact, shifts some of the information into a dictionary file, or if it's a common dictionary, into compressor/decompressor itself.
In the case from the study, the experimenter actually has to supply most of the information required to pull Harry Potter out of the model - they need to make specific prompts with quotes from the book, and then observe which logits correspond to the actual continuation of those quotes. The experimenter is doing information-loaded selection multiple times: at prompting, and at identifying logits. This by itself doesn't really prove the model memorized the book, only just that it saw fragments from it - in cases those fragments are book-specific (e.g. using proper names from the HP world) instead of generic English sentences.
yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?
I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)
That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!
Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.
Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.
Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.
I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.
Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.
Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.
If I give you a vision algorithm that, given every other frame of a Harry Potter movie, can accurately predict the interstitials - would you say that half that Harry Potter movie is "in" it?
Congratulations, you've just invented a video codec with motion estimation. The motion experts group wants their share on some bullshit royalties/patents though, better pay up because they are very litigious and won't go soft on you because you are not a big tech corporation :)
An LLM is not a database. There is no significant amount of information in a model that can be accessed 100% of the time. This is because it's a mystery to the user what collection of tokens will lead to a specific output. To get a predictable result from an LLM 50% of the time is very significant.
This doesn't tell us for certain whether or not the model was trained on a full copy of the book. It's possible that 50-token long passages from 42% of the book were, incidentally, quoted verbatim in various parts of the training data. Considering the popularity of both the book itself, and derivative fan-fiction, I would not be surprised. I would be less surprised to learn that it was indeed trained on a full copy of the book, if not several.
The more meaningful point here is that the ability to reproduce half a book is the same sort of overt derivative work that is definitely considered copyright infringement in other circumstances. A lossy copy is still a copy. If we are to hold LLMs to the same standard as other content, this isn't very easy to defend.
Personally, I see this as a good opportunity to reevaluate copyright on the whole. I think we would be better off without it.
Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.
To be pedantic, short quotes (as opposed to short copied fragments that are not used as quotes) are explicitly one of the allowed uses (Zitierbefugnis). You can even quote entire works "in an independent scientific work for the purpose of explaining its content"! https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...
Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.
Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.
> 50 tokens is not really very much
Yes! And also llama3.1’s tokens are different from Qwen and llama1 tokens. That’s the first model where meta started to use very large vocab_size.
Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.
To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.
Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.
> one of those tricky semantic arguments we have yet to settle when it comes to LLMs
Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?
It's definitely not "this guy can predict next 10 characters with 50% accuracy."
Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.
> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time
As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.
So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.
Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.