It's important to note the way it was measured: > the paper estimates that Llama...

Aurornis · 2025-06-16T02:54:12 1750042452

> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.

That’s what I was thinking as I read the methodology.

If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?

vintermann · 2025-06-16T04:09:53 1750046993

All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?

There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?

fiddlerwoaroof · 2025-06-16T04:52:19 1750049539

The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy

vintermann · 2025-06-16T05:00:43 1750050043

Anything that can tell you what the typical patterns of English is, is going to be a language model by definition.

fiddlerwoaroof · 2025-06-16T05:10:03 1750050603

My point is that this might just prove that Harry Potter is the sort of prose “fancy autocomplete” would produce and not all that original.

EDIT Actually, on rereading, I see I replied to the wrong comment.

fiddlerwoaroof · 2025-06-16T04:53:54 1750049634

Or else, LLMs show that copyright and IP are ridiculous concepts that should be abolished

jxjnskkzxxhx · 2025-06-16T14:04:57 1750082697

Suppose for simplicity that every sentence in the book is 50 tokens or shorter.

According to the stated methodology, I could give the LLM sentence 1 and have 42% chance of getting sentence 2 recalled. Then I could give it sentence 2 and have 42% chance of getting sentence 3. Therefore, the LLM contains 42% of the book in some sense.

I disagree this is "not really very much". If a person could do this you would undoubtedly conclude that the person read the book.

In fact the number 42% even understates the severity of the matter. Superficially it makes it sound that the LLM only contains less than half of the book. In reality the process I described applies to 100% of the sentences. Additionally I'm guessing that the 58% times where the 50 tokens arent recalled correctly, the outputted token probably have the same meaning as the correct one.

TeMPOraL · 2025-06-16T15:59:12 1750089552

Except it's not what happened, per the article. Instead, they walked down the logits, which is more like asking someone to give 10-20 best guesses for next word, and should one of them match the secret answer, telling them which one is it and asking them to go on with the next word. Seems like a substantially easier task, and most of information is coming from researchers making a choice at every step.

bee_rider · 2025-06-16T03:15:35 1750043735

Even if it is recalling it 50 tokens at a time, the half of the book is in some sense in there, right?

everforward · 2025-06-16T05:56:37 1750053397

I don’t think this paper proves that, and I don’t think it is in a traditional sense.

It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.

It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?

TeMPOraL · 2025-06-16T12:18:53 1750076333

Not necessarily. Information is always spread between what we'd normally consider "storage medium" and "reader"; the degree to which that is is a controllable parameter.

Consider e.g.:

- Digital expansion of PI to sufficient decimal places contains both parts of the work and full work in full. The trick is you have to know where to find it - and it's that knowledge that's actually equivalent to the work itself.

- Any kind of compression that uses a dictionary that's separate from the compressed artifact, shifts some of the information into a dictionary file, or if it's a common dictionary, into compressor/decompressor itself.

In the case from the study, the experimenter actually has to supply most of the information required to pull Harry Potter out of the model - they need to make specific prompts with quotes from the book, and then observe which logits correspond to the actual continuation of those quotes. The experimenter is doing information-loaded selection multiple times: at prompting, and at identifying logits. This by itself doesn't really prove the model memorized the book, only just that it saw fragments from it - in cases those fragments are book-specific (e.g. using proper names from the HP world) instead of generic English sentences.

zmmmmm · 2025-06-16T04:02:29 1750046549

yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?

vintermann · 2025-06-16T04:14:46 1750047286

I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)

bee_rider · 2025-06-16T05:13:28 1750050808

That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!

kelipso · 2025-06-16T15:22:10 1750087330

Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.

Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.

Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.

bee_rider · 2025-06-16T17:24:01 1750094641

I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.

Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.

Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.

dTal · 2025-06-16T20:30:54 1750105854

If I give you a vision algorithm that, given every other frame of a Harry Potter movie, can accurately predict the interstitials - would you say that half that Harry Potter movie is "in" it?

amlib · 2025-06-16T21:21:14 1750108874

Congratulations, you've just invented a video codec with motion estimation. The motion experts group wants their share on some bullshit royalties/patents though, better pay up because they are very litigious and won't go soft on you because you are not a big tech corporation :)

thomastjeffery · 2025-06-16T15:25:49 1750087549

An LLM is not a database. There is no significant amount of information in a model that can be accessed 100% of the time. This is because it's a mystery to the user what collection of tokens will lead to a specific output. To get a predictable result from an LLM 50% of the time is very significant.

This doesn't tell us for certain whether or not the model was trained on a full copy of the book. It's possible that 50-token long passages from 42% of the book were, incidentally, quoted verbatim in various parts of the training data. Considering the popularity of both the book itself, and derivative fan-fiction, I would not be surprised. I would be less surprised to learn that it was indeed trained on a full copy of the book, if not several.

The more meaningful point here is that the ability to reproduce half a book is the same sort of overt derivative work that is definitely considered copyright infringement in other circumstances. A lossy copy is still a copy. If we are to hold LLMs to the same standard as other content, this isn't very easy to defend.

Personally, I see this as a good opportunity to reevaluate copyright on the whole. I think we would be better off without it.

adrianN · 2025-06-16T03:29:30 1750044570

Fair use is not a thing in every jurisdiction. In Germany for example there are cases where three words („wir sind Papst“) fall under copyright.

yorwba · 2025-06-16T04:47:51 1750049271

Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.

adrianN · 2025-06-16T06:00:23 1750053623

Of course, but „it’s a short quote so you can use it“ is not true (at least in Germany).

yorwba · 2025-06-16T06:41:35 1750056095

To be pedantic, short quotes (as opposed to short copied fragments that are not used as quotes) are explicitly one of the allowed uses (Zitierbefugnis). You can even quote entire works "in an independent scientific work for the purpose of explaining its content"! https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...

Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.

Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.

om8 · 2025-06-16T12:56:07 1750078567

> 50 tokens is not really very much Yes! And also llama3.1’s tokens are different from Qwen and llama1 tokens. That’s the first model where meta started to use very large vocab_size.

arthurcolle · 2025-06-16T04:14:18 1750047258

You could prove this much better by looking at something like this: https://cookbook.openai.com/examples/using_logprobs

amanaplanacanal · 2025-06-16T03:20:05 1750044005

Fair use is a four part test, and the amount if copying is only one of the four parts.

seydor · 2025-06-16T07:59:28 1750060768

The claim of the paper is not so much that the model is reproducing content illegally but that harry Potter has been used to train the model.

This does not appear to happen with other models they tested to the same degree

xnx · 2025-06-16T03:00:58 1750042858

This sounds almost like "Works every time (50% of the time)."

hsbauauvhabzb · 2025-06-16T03:21:34 1750044094

Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.

raincole · 2025-06-16T02:54:29 1750042469

(Disclaimer: haven't read the original paper)

It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.

(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)

tanaros · 2025-06-16T03:34:38 1750044878

Their methodology seems reasonable to me.

To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.

Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.

raincole · 2025-06-16T07:47:20 1750060040

> one of those tricky semantic arguments we have yet to settle when it comes to LLMs

Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?

It's definitely not "this guy can predict next 10 characters with 50% accuracy."

Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.