Hacker News new | past | comments | ask | show | jobs | submit login

The results in the paper (page 7) are empirical and reasonably convincing across both ChatGPT and a variety of other open source models.

Why do you think it’s misleading?

You think it’s just generating plausible random crap that happens to exist verbatim on the internet?

I mean… read the paper, 0.8% outputs were verbatim for gpt-3.5.

I’m not sure how you can plausibly claim that’s random chance.

> I think it’s a bug

It is a bug, but that doesn’t make it misleading or untrue.

This is like saying a security vuln in gmail that lets you steal 1% of mail is misleading. That would not be a bug, it would be a freaking disaster.

The problem here is that (as mentioned in other comments), training LLMs in a way that avoids this is actually pretty hard to do.

/shrug




> You think it’s just generating plausible random crap that happens to exist verbatim on the internet? > I mean… read the paper, 0.8% outputs were verbatim for gpt-3.5.

Look at the sorts of outputs they claim are in the training data. Also note that their appendix includes huge chunks of text but they do not claim the entire chunk was matched to existing data — only a tiny amount of it.

The “bug” to me is something about losing its state and generating a random token. Now if that random token is “Afgh”, I’m not surprised it follows up with “Afghanistan” and a perfect list of countries in alphabetical order. I’m also not surprised that appears in training data, because it appears on thousands of webpages.

So it’s not that there isn’t an overlap between the GPT gibberish and internet content, and therefore likely training data. It’s that it’s not especially unique. If it were — like reproducing a one off Reddit thread verbatim — I think that would be greater cause for concern.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: