This looks really promising! Other than this sentence: > We curated a large data...

brucethemoose2 · on Feb 14, 2024

They have at least some of the dataset uploaded:

https://huggingface.co/LargeWorldModel

And the model page specifically mention Books3

agnosticmantis · on Feb 14, 2024

According to [0]:

- Books3 dataset

- 700B text-image pairs from Laion-2B-en, filtered to only keep images with at least 256 resolution

- 400M text-image pairs from COYO-700M, filtered to only keep images with at least 256 resolution

- 10M text-video pairs from WebVid10M

- 3M text-video pairs from a subset of InternVid10M

- 73K text-video chat pairs from Valley-Instruct-73K

- 100K text-video chat pairs from Video-ChatGPT

0: https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax#train...

nickpsecurity · on Feb 14, 2024

While I’m not sure about this one, many AI’s do hide their training data because it’s illegally obtained (ie file sharing of copyrighted works). That’s half of why I dropped AI. The “Proving Wrongdoing” part of my article has specific examples of it:

http://gethisword.com/tech/exploringai/

candiodari · on Feb 15, 2024

Same is true for "training data" of most/all humans.

nickpsecurity · on Feb 15, 2024

No it’s not. Pre-Web, humans were mostly trained by our parents, our schools/colleges, places we go, and things they had access to (eg cable TV). Whether free or paid, they had legal access to that data. It would only be illegal if they started distributing extra copies or doing their own performances of the band.

These companies do the very thing that file sharing cases already ruled was illegal. They also scrape all kinds of material whose licenses often say they can’t use it without citations, commercially, etc. The authors asked for some benefit in return for free goods they shared. After not giving them that, the AI suppliers have the nerve to both sell the results and put legal restrictions on them, including terms for sharing. So, they ignore their training suppliers’ legal rights while asserting the same kinds of legal rights for themselves for profit.

How humans are trained has nothing to do with AI’s unless you were raised by theft, cons, and hypocrisy. There’s certainly people like that. It says more about the sinful nature of humanity than training AI’s, though.

candiodari · on Feb 15, 2024

Humans produce new works based on their experiences, which is a nice way of saying: "based on others' works they have seen".

This is considered original work unless it's too blatantly copied, despite those humans never having a license to create derivative works. In other words it's legally treated as if no other works contributed to it (again, unless it's too blatantly copied)

Note: this is law working like this. Not a license, not a contract. Authors do not have any power under copyright to prevent this, nor do they have power to demand something in return. Not even in cases where it damages then, like parodies or reviews destroying a work's appeal/reputation/sales.

In practice "blatant" has to be pretty damn blatant. Almost always only exact copies are found to be violating and even then (e.g. Google summaries do not violate copyright despite copying portions of the source material)

Hence human works are the same as AI works. Assuming not too blatantly copied, why shouldn't they be treated as original works?

nickpsecurity · on Feb 15, 2024

Yes, humans produce new works based on their experiences. Their legally-permitted experiences. If they committed crimes and their works reveal it, they can be punished for those crimes. AI’s should not be treated any better than human beings in humans’ legal system.

Far as infringement, I’m not sure if you’re talking about copyright law in your comment or how you would prefer legal systems to be designed. You didn’t mention any of the basic rules of copyright that apply to training data. They include the rights to distribute and show the copyrighted works.

Under copyright law, people taking others property to distribute it without their permission is routinely treated as theft. Taking something from someone shared under specific conditions, but not making good on your end, is also treated as a problem. Many voters who aren’t lawyers consider those immoral acts. They also think artists should get some rewards, maybe have rights, and people should honor agreements.

The datasets the AI suppliers have built and shared break many of these laws. That’s where I’m coming from. God commands us to obey the law to be blameless with a more stable society. We can’t just each break the ones we don’t like expecting no consequences.

It makes sense to reform it, though. If you read my link, there should’ve been a proposal you might like that allows the things you want. Assuming a powerful copyright lobby, I drafted the proposal to protect their works (ie money/fame) while allowing anything people can legally access to be used in training AI’s. Their outputs’ copyrights would be treated however peoples’ are (same interpretations). That should cover the vast majority of use cases for model training while blocking infringements, rip offs, etc.

candiodari · on Feb 16, 2024

I don't think anyone is accusing AI models of distributing copyrighted works verbatim, so any argument will have to focus on AI derivative works, not original ones.

But if I understand you correctly, you're complaining that the data OpenAI (for example) downloaded from the internet and presented to GPT4 does not count as legally acquired? Why not? It was downloaded from the internet so I think that implies it did not violate any license on OpenAI's part. Saving it for a long time might be in the grey zone, but generally that is accepted, when it comes to humans, either as fair use, or a technical necessity (such as caching).

nickpsecurity · on Feb 16, 2024

"I don't think anyone is accusing AI models of distributing copyrighted works verbatim"

They do that, too. They've been caught, reported on, and lawsuits are in progress. I have piles of verbatim quotes from them about certain material. I was actually using ChatGPT partly for that research since I thought the (free) source was legally clear. Later, I found out it was against their highly-readable license. OpenAI had taken their work without permission against their license terms. I deleted all my GPT artifacts. That's all I can say about that one.

"But if I understand you correctly, you're complaining that the data OpenAI (for example) downloaded from the internet and presented to GPT4 does not count as legally acquired?"

Why was in the article I shared. This section has specific claims on their data:

https://gethisword.com/tech/exploringai/provingwrongdoing.ht...

The books in GPT, BooksCorpus2 in The Pile, the papers that forbid commercial use (eg some in Arxiv), corporate media's articles, and online resources used outside the permissions are easy examples. Basic, copyright law says you have to obey certain principles when using published works. They were ignoring all of them.

Most file-sharing cases also say you can't distribute copyrighted works without the authors' permission. Even free ones since they're often free on sites that support the authors, like with ads or publicity. They're (a) passing collections of such material around which is already illegal and (b) in ways that only benefit them, not the authors.

When tested for copyright infringement, one thing they look at is who gets value out of the situation. Did they take away the value, esp financial, that the author would get from their work in their own use of the work? Are they competition? That ChatGPT's answers replaced a lot of their users' use of source material says that might be a yes. And does the new work exist to make a profit or for non-commercial use? Most of them sell it with OpenAI and Anthromrophic making billions off others' copyrighted works. Definitely yes. Do they ignore others copyright and contract rights while asserting their own? Yes, hypocrites indeed.

Even a junior lawyer would warn you about most of these risks. They're commonly used in copyright cases. The only way they could fail almost across the board is if they were doing it on purpose for money, power, and fame. If so, they deserve to experience the consequences of those actions.

Also, let's not pretend the folks getting billions of dollars for AI development couldn't have paid some millions here and there for legal data. Their own research says high-quality data would've made their AI's perform better, too. Greed was working against everyone's interests here if their interests were what they say (public-benefit AI).

jewelry · on Feb 16, 2024

So... what if my "c++ primer" copy is from z-library... does that disqualify all my c++ program?

nickpsecurity · on Feb 16, 2024

You have to read the terms like I did for the various sites they scraped. I did a quick skim of theirs:

https://z-lib.io/pages/terms-of-use

Many of these sites have terms that give the site owner a license to do the activities that would happen when training AI’s. In theory, some terms look like they could even bundle the data for AI training or run that phase themselves. It’s actually a good, market/public-benefit opportunity I hope they act on, esp HN and Arxiv.

I didnt read enough to see if any random person using the site had a license to do anything they want with any copyrighted content on the site vs just reading it. It did have a lot to say about copyright, though. Two quotes were interesting:

“z-lib.io the site doesn't provide any piracy copyright infigrment contents and you shouldn't use site for illegal copyright piracy.”

“Company respects the intellectual property of others and asks that users of our Site do the same. In connection with our Site, we have adopted and implemented a policy respecting copyright law that provides for the removal of any infringing materials and for the termination of users of our online Site who are repeated infringers of intellectual property rights, including copyrights.”

(Note: One of those quotes is also on the About Us page.)

They respect copyright, claim to not violate it, ban violating it, and will take down violators and their uploads. It would seem z library has a stronger stance on copyright protection than the AI community.

catchnear4321 · on Feb 14, 2024

the sentence itself should be alarming.

the claim is that a dataset was created. of words and of videos, and that it was created from public datasets of books and videos, those datasets containing books, and videos.

it takes too many words to say almost nothing.

nothing to see here.

if that isn’t the intent, then the authors need to do better.

pk-protect-ai · on Feb 14, 2024

The information is in the model card though:

Books3 dataset 700B text-image pairs from Laion-2B-en, filtered to only keep images with at least 256 resolution 400M text-image pairs from COYO-700M, filtered to only keep images with at least 256 resolution 10M text-video pairs from WebVid10M 3M text-video pairs from a subset of InternVid10M 73K text-video chat pairs from Valley-Instruct-73K 100K text-video chat pairs from Video-ChatGPT

catchnear4321 · on Feb 14, 2024

wouldn’t that have been a cleaner explanation than the sentence provided? books and videos, see model card. the redundant language is a smell whether the emitter wishes to acknowledge or not. the point still stands, the hot mess of a sentence didn’t need to be that way.

> …so petty and pedantic…

if nothing else, think of the language models that need to digest this. sure you can send in gobbledygook and get out plausibly sense, but why?

llms will push pedantry to the forefront. or suffer from it. who knows. have fun.

lern_too_spel · on Feb 14, 2024

You've never decided to rewrite a sentence and forgot to check the entire sentence again after an incomplete refactoring? I'd say you're in the minority. This is a v1 draft on Arxiv. I don't expect the final paper to have that sentence.

catchnear4321 · on Feb 15, 2024

hence the feedback.