How to fix “AI’s original sin”

kragen · on June 21, 2024

i think plausibly being able to use youtube video as training data was the major reason for google to buy youtube in the first place. i'd be very surprised if youtube terms of service actually prohibit google from doing this

also, while a lot of tim's thoughts are excellent, i strongly disagree with this part

> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paid

reading books, watching videos, and attending trainings or other performances are not rights reserved to copyright holders, and indeed the history of copyright law carefully and specifically excludes such activities from requiring copyright licenses. consequently copyright holders do not in fact get paid for them. the first sale doctrine means that, in the usa (where the nyt has filed their lawsuit), not only can copyright holders not charge people for reading books and watching videos, they can't even charge them from reselling used books and videos

this is fundamental to the freedom of thought and inquiry that underlie liberal civilization; it's not a minor detail

solresol · on June 21, 2024

> i think plausibly being able to use youtube video as training data was the major reason for google to buy youtube in the first place

When I asked Eric Schmidt the "why did we spend so much money on buying youtube?" question his answer was "if it's the future of television, it was a bargain; if not, we overpaid."

There didn't seem to be any expectation among senior management at the time that it was anything other than "televisions carry advertisements, we want in on that market."

kragen · on June 21, 2024

thank you, my only interaction ever with eric was when he ate lunch on my wife

pentae · on June 21, 2024

Was this a Nyotaimori situation or did you mean "with my wife"

kragen · on June 21, 2024

we were all in a pit full of plastic balls, she was not visible

Izkata · on June 22, 2024

Don't forget they also failed to gain any traction with Google Video before buying Youtube.

numpad0 · on June 21, 2024

> plausibly ... training data was the major reason for google to buy youtube

I'd agree, and I'd also argue people were totally cool with that until LLM/GenAI happened.

Somehow it's cool and exciting if you fed YouTube data to reconstruct historical artifacts, prototyped self driving car software, trained super-resolution algorithm, so on, but not GenAI. It's a different thing altogether. It's a double standard, or at least a set of criteria with a hidden decisive criterion.

Just IMO, I think that "double" standard has to be discussed more. It's supposedly about copyright but something is off, and it's definitely not about monetary compensation(individual works of art nor collective income support). There's something else with GenAI/LLM that make people want it gone.

e: anecdotal datapoint that people were cool about AI until LLM/GenAI/OpenAI[1] - no talks of safety, training data provenance, societal harm, nothing negative whatsoever from a digital camera news-blog - and it's about a Diffusion model:

  Enhance! Google researchers detail new method for upscaling low-resolution images with impressive results
  Published Aug 30, 2021 | Gannon Burgett
  [...] Or is it? A new blog post on the Google AI Blog showcases a new technology its developed to upscale low-resolution images with incredible results.

1: https://www.dpreview.com/news/0501469519/google-researchers-...

tubignaaso · on June 21, 2024

Could it be a property of the transformative nature of those non-GenAI models? Using the data to create self driving systems or enhance existing works is adding value to the pool of work. It takes the copy written data and creates something new. GenAI, by comparison, seems to devalue existing works. It takes the same data and creates competing works at best, straight up copies at worst.

numpad0 · on June 22, 2024

That's one highly plausible possibility, but I also think they could be something else e.g. people just not liking AI aesthetics.

robertlagrant · on June 21, 2024

> There's something else with GenAI/LLM that make people want it gone.

Generative AI is in the news a lot right now because clicks. "AI will take your job" is out there a lot, possibly because "writer of low resolution news articles" is what it probably could replace, and so the writers have it on their minds.

deelowe · on June 21, 2024

I doubt using it as training data was specifically the goal, but Google has always believed more data = more profit over the long term. This is why Gmail launched with unlimited storage.

Semaphor · on June 21, 2024

Was it unlimited? I only remember it being a decently high number at the time, far higher than any other freemailer.

palad1n · on June 21, 2024

I remember that Gmail was released on April 1st (on purpose), and many people thought it was a joke because it came with 1GB of storage, while places like Yahoo had like 20MB.

lolinder · on June 21, 2024

It technically had a limit, but I remember thinking of it as unlimited because the amount of storage available counted up at a faster rate than I was saving emails.

deelowe · on June 21, 2024

Maybe you're right. It's been a while. Either way, the philosophy was always to make money off the data somehow even if we didn't know how at the time.

TheDudeMan · on June 21, 2024

Correct. 1GB.

lxgr · on June 21, 2024

“And counting”, famously! I remember watching that counter crawl up and up in disbelief, when Hotmail and local alternatives were offering at most 10MB on their free plans.

ajkjk · on June 21, 2024

No way, AI wasn't on the radar back then.

dannyobrien · on June 21, 2024

At EFF we were arguing with Google about their permanent collection of user data early (I joined in 2005, and we already were putting pressure on them then). Whenever we asked, said they were sure it would come in useful for improving their services. Google just institutionally strongly believed in the value of data.

I sometimes wonder if the form of our current machine-learning boom is actually based on that conviction and the determined search for applications, rather than modern AI being a vindication of that strategy. A bit like Moore's Law: is it an iron rule of technology, or just a way to coordinate a huge amount of resources across an industry?

Kye · on June 21, 2024

It was a long time ago that I read a history on this, and I might be missing a detail, but the gist was Google's investors were clamoring for profits following the .com crash and Google realized the data they had was a gold mine if they could just figure out how to apply ML to it.

They tried really hard and did okay for a while using it for advertising, but Doubleclick did it better, so they bought it in 2008.

AI (ML) was absolutely on the radar.

kragen · on June 21, 2024

google has been an ai company since way before ai was fashionable. they hired norvig before i first met him in 02001. you can find dekhn comments here about larry and sergey talking with him about the central importance of ai back last millennium. i've also heard it myself from other early googlers (though not larry and sergey)

dymk · on June 21, 2024

Occam's razor. Google has always been an ads company. YouTube had big ad potential. Saying it was for the training data for AI that wouldn't exist until over a decade later ignores the obvious.

kragen · on June 21, 2024

google was an ads company before they bought youtube but they were an ai company before they were an ads company

wrs · on June 21, 2024

I’m having a hard time understanding Tim’s point here. He seems to have retrieval and generation confused, or combined, or something.

Of course in the retrieval (R in RAG) you can do attribution of the source material: you’re just doing a search and bringing up a literal excerpt of something from your database. You know exactly where you got it.

But then for generation (G) you hand that excerpt to the LLM, and the only reason the model can “understand” it is the vastly larger corpus of text you trained the model on, whose origin is, due to the very nature of LLMs, smeared out of existence. That is the controversial aspect that (AFAIK) has no technical solution at present.

He seems to imply that because attributing the R part is easy, attributing the G part should be too, but those are completely independent problems. Not to mention that you don’t have to do a retrieval step to generate things in the first place; the LLM alone can do that.

The part about doing a retrieval on the output to see if it’s similar to something else is at least technically possible, but he handwaves past the problem of what on earth you’re supposed to do if you find something. YouTube doesn’t do a great job (e.g., people getting copyright strikes from their own performances of public domain works) and it at least has an unambiguous set of things to search for.

antihipocrat · on June 21, 2024

My read on it was that the terms of use of content services can prevent that content being used to train LLMs.

If LLMs were trained using data from these services then it's possible that this will be challenged in court. Should the court rule in favour of the content services, then the LLMs may need to be re-trained (or likely negotiate compensation).

CaptainFever · on June 21, 2024

Related case: https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

wrs · on June 21, 2024

Hm, I thought the overall goal was that you would train LLMs on that data, but the owners of the data would be compensated when output was generated that was influenced by it.

Somehow we have to be able to train LLMs on high-quality information, without having the resulting generative capability destroy the economic support for creating that information in the first place.

neilwilson · on June 21, 2024

> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paid

That's what copyright holders would like - hence why they try to restrict the licences so much these days.

It isn't the case. Many individuals can read, watch and listen to a work where only one payment has been made.

But really all the AI bots are doing is reading, watching and listening to the content and remembering it in pretty much the same way as a person does.

Is an artist who has listened to a bazillion blues numbers and then constructs a new song based upon what they have heard and ingested in violation of copyright?

If not, then neither is AI. It's just a probability matrix, not a replica.

If the AI people have paid the correct fee to listen, watch or read the material and then sell what they remember it is no different from any trained professional. The contract has been fulfilled.

The problem we have is that copyright holders have been dining off past glories for too long, and have gained the power to embed that rent for way longer than is sensible. What AI does is make those copyrights rot faster as it has a better memory than most people. That's good for society, because it forces copyright holders to produce more new stuff if they want to maintain their income.

A rebalancing away from rent seekers towards regular producers would be good for everybody.

CaptainFever · on June 21, 2024

> If the AI people have paid the correct fee to listen, watch or read the material and then sell what they remember it is no different from any trained professional. The contract has been fulfilled.

I'd like to point out that this is indeed how it works legally in Singapore already. If you have legal access to a copy of the material, you can do various things such as make backup copies, and yes, train AI on it. This excludes pirated or hacked materials. And the owner is not allowed to stop you by using a TOS.

It's the same for the EU, but over there they allow TOS exclusions for commercial training.

nicbou · on June 21, 2024

In your last paragraph, are artists rent seekers and large companies that train AI producers? This is not how I feel when Google inserts its AI between my website and its readers.

CaptainFever · on June 21, 2024

What if we flip the dynamics: a hobbyist training a LoRA on Disney movies? This is just an appeal to... well, "big = bad".

I believe what GP meant by rent-seekers are copyright holders who expect to own information for a hundred years, including all derivatives of it. It could be Disney, it could be an "original character do not steal" artist, and it could be "don't train on our AI" OpenAI. While regular producers would be those who use free licenses, or at least are OK with derivatives.

nicbou · on June 21, 2024

OP makes a fair legal argument, but if anything it might mean that the laws are not in line with reality. Laws are made for man, not the other way around.

From my perspective as someone who lives from putting information on the internet, all of these people are rent seekers. By that I mean that they insert themselves between me and my audience to derive a profit. And they deny me income and credit in the process. I create the bulk of the value and put in the bulk of the labour. They swoop in and take everything.

It's especially frustrating that creators are still expected to create things and feed the AI that claims to replace them. It's a slap in the face that not only they strip you of an audience, an income or even credit, but they also expect you to keep creating without any of that.

The problem with the "it's learning just like a human" argument is that it never accounts for the superhuman scale at which it's done. Humans rarely read every single book, and they rarely repeat the information to billions.

Put plainly, I don't want to be the anonymous labour behind another person's profit. No legal argument will make this sort of future attractive to me.

YeGoblynQueenne · on June 21, 2024

Damn right. Put another way: a language model only needs to be fed data (and compute) but a human needs to eat actual food to be able to generate that data in the first place. So if big tech corps want their models trained, then should pay the people who generate the data.

Wtf is it with people asking how they get to make money if they're not allowed to take away all of yours?

radarsat1 · on June 21, 2024

So, I generally agree with you, but my question to you is how do I square that with thinking this technology is legitimately amazing and I want to see more of it and use it? Do you forsee any future in which creators are happy and this technology gets to exist?

You mention getting paid, this might be part of it, but (a) how do you manage payments when so many sources contribute such a small proportion of the input and (b) would this actually solve the problem or would creators continue to object despite getting paid according to their representation in the dataset?

I imagine that you will answer that (a) is "their problem" so feel free to focus on (b).

nicbou · on June 22, 2024

My main gripe is that we were never given the chance to opt out. This is where I would start.

I don't think that there's a perfect solution and I agree that the technology is amazing. I just wish that we didn't deny the impact of taking work from creators while denying them the rewards.

radarsat1 · on June 22, 2024

I guess part of my issue with the argument is that creators seem to want to have their cake and eat it too. They want to put their work out into the public, but then they want to also control what people can do with it. These two themes, although I can see the need and logic for them, seem to be totally antithetical on a practical basis.

In some sense art creation is, indeed, a risk.. you put something out there, and then you hope it is perceived well, and hope that it isn't abused. That's what makes making art such a difficult thing.. you are putting something out there, a piece of yourself.. but not acknowledging the associated risks, that someone might hate it, that someone might not understand it, or indeed that someone might "sample" your work into a new context, is being totally disingenuous with yourself as an artist.

Of course I have no issue with the idea that artists should be fairly compensated, and the whole legal idea of copyright exists exactly for that, and I'm incredibly interested to see how this whole thing plays out in the legal sense. But I have to say that I'm not entirely convinced that using publicly available sources for training AI, that don't explicitly come with copyright terms that forbid it, is in any sense unethical, because I think it's entirely possible that it may end up being considered fair use.

Artists want to get paid, yes, but the whole "I didn't get to opt out" argument doesn't really sit well with me, because the degree of post-publication "control" being implicitly demanded in that argument doesn't seem to be very balanced. In fact you did get to opt out, software for example comes with extensive legal documents describing exactly how derivative works can use it, but books and photos generally don't, and maybe in the future they will have to, but the idea that retroactively interpreting the role of copyright on existing works to include special considerations for AI just doesn't seem that obvious. You might not like that you published work in a way that someone was able to use it in a way that you disapprove of, without explicitly forbidding said usage, but the fact is that you did. The rest is for the courts to sort out. And I think it could go either way.

protocolture · on June 21, 2024

I mean the way to fix it is to recognise that its absolutely cool to train AI on publicly available knowledge. Like its not a sin.

Maybe it comes from growing up with google hoovering up the internet, file sharing becoming common place and image boards making sharing copyrighted photos as reaction images the done thing. But I already feel like I own the sum total of human knowledge. I dont recognise sony or disney or the US government as valid inheritors or controllers of information. Its mine. And if there's a set of tools that can chew on that data to make new or interesting or collated or curated data then that just makes more data that I own. I own your art. I own your code. I own your stack overflow answers. I own every film and tv show and book from human history. And if that common ownership isn't the end goal then I dont know what is.

If free use doesnt currently cover these cases it should be extended to cover them.

idle_zealot · on June 21, 2024

You're not going to find many copyright abolitionists here. Apparently the idea that culture and information is a public good and not a product to be bottled and sold is so unpopular that it won't even be stomached in order to defend HN's favorite tech companies.

diputsmonro · on June 21, 2024

Information in the abstract, maybe. But it's entirely different to assume ownership of artwork other people produce, either for profit or personal expression. Maybe I would feel a little differently if all the AI art I see came from a place of earnest artistic expression, but everything is just some kind of scam or another. You aren't going to convince me that honest artists deserve to have their art stolen (often by name) and livelihoods destroyed so that crypto grifters can generate NFT assets a little easier and scummy companies can lay off their art departments. This is a tool that will be used to concentrate wealth and keep the little guy down, pure and simple.

And I'm definitely not swayed by the argument from this thread's OP of "I just want to steal everything I've ever seen and not have to do any hard work to create something myself".

CaptainFever · on June 21, 2024

I agree with you. People hate the powers of copyright until you give them some. Then they become even worse than Disney.

JohnFen · on June 21, 2024

> its absolutely cool to train AI on publicly available knowledge. Like its not a sin.

I think it is a sin to train LLMs on other people's works without their permission, but I separate this out from whether or not it's legally permissible. Not everything that is legal is acceptable.

In any case, right now anyway, the state of things is that if you put it on the open web, it's going to be used for training whether you approve of that or not. Which is why I and many others have abandoned the open web. It's impossible to publish there in a way that ensures that your work is not used in ways you disapprove of.

musicale · on June 21, 2024

AI's original sin isn't copyright violation. It's training machines on human-produced output.

mensetmanusman · on June 21, 2024

This is the middle class fighting the middle class; the billionaires that own both of types of organizations and the AI capital will inevitably use it to increase productivity and capture all of the excess monetary gains while the middle class shrinks.

redwoolf · on June 21, 2024

I'm in partial agreement with you. But I don't think the middle class will shrink, I think it will disappear.

The real threat to humanity from AI is not a Skynet or a Matrix like intelligence. The threat is that Artificial Intelligence WILL make labor irrelevant. When human capital has no value, we will disappear.

Don't kid yourself into thinking that those who owns the means of production will stop to help. If they were interested in that, they'd be doing it now. There's enough money and food to feed everyone on the planet. People still starve.

mitthrowaway2 · on June 21, 2024

Multiple threats can be real at the same time, and in this case are!

egypturnash · on June 21, 2024

What makes you so sure that “revolution” will not be the result of the middle class being destroyed by AI?

randcraw · on June 21, 2024

Revolution against what? Should we act NOW to take out all the plausibly psychotic plutocrats, well before they can arm themselves?

Or do we wait until they actually create an army of armed drones? Of course, by then it'd be too late.

Or do we choose a time frame that's in-between, trusting that the politicians and the courts and the police departments will pass and enforce laws to protect us before the plutos can amass power via fully automated factories -- laws that will never actually come to pass, since plutocrat-interests-and-their-minions already own the current system?

Nah. We frogs will boil LONG before revolution is a viable option.

radarsat1 · on June 21, 2024

> before the plutos can amass power via fully automated factories

Fun science fiction but this idea that the rich could develop an entire self sustaining infrastructure to protect themselves from any leverage from the working class is completely infeasible.

Let's day a full automatic drone factory was built. Factories don't make their own tools and general parts, they don't mold their own screws and bolts, no, they source those. Even if they could, they'd need raw materials. So they'd need completely automated mines. Those mines would be on land that they'd own, but they'd have to physically defend it. So the scenario you're proposing is: fully automated mining, with fully automated defense measures, delivering raw materials to some fully automated smelter (all travel routes automatically defended against sabotage), this all to feed a factory that would churn out defense robots. It's completely circular and silly.

nortonsdad · on June 21, 2024

They don't need to do any of that. They'll just use AI analysis combined with mass surveillance to identify and crush any organized dissent before it can form into anything close to a threat. A very small group of goons is all that is required.

randcraw · on June 21, 2024

All that may be true in the large open democracies, but it's already false in Russia, China, North Korea, Vietnam, and beyond... and based on the current propaganda-driven trends of mindless angst driving most democracies today, you could live most anywhere else and see a similar outcome in just a decade or two.

So in the next 20 years, which do you think will happen first: dictatorship by plutocrats elected to office, or by plutocrats who overthrow elections after they lose?

radarsat1 · on June 21, 2024

Ok but all those dictatorships are enforced by humans, not robots.

mensetmanusman · on June 21, 2024

The robots are here, helping man 10x their oppression productivity.

radarsat1 · on June 21, 2024

Sure but I was responding specifically to a scenario proposing that rich people would develop fully automated factories thereby cutting out their reliance on workers. Getting even halfway there still leaves plenty of room for revolution, I'm just responding to the 100% automated scenario as being unrealistic.

redwoolf · on June 21, 2024

I’m not sure of that. In fact, I hope that’s the outcome. I worry that capitalism has destroyed the revolutionary spirit.

Nasrudith · on June 21, 2024

It is a major logistics problem. Trust me, you would be WAY more mad if billionaires tried to feed everyone. You'd call it 'naked neocolonalism' or similar because it would require conquest of all nonconforming regions.

apsurd · on June 21, 2024

isn't the issue that capitalism is amoral? you're saying the capitalists (should be) helping by now if they wanted to. They don't do it because capitalism isn't directly incentivizing it?

growing economic power is supposed to help "all of us" in proportion to our input value.

All this to say, i agree, but it means we need to augment capitalism - not speak to help and morality?

redwoolf · on June 21, 2024

> isn't the issue that capitalism is amoral?

How can you call a system that disincentivizes collective care of one another "amoral"?

> but it means we need to augment capitalism

What a pithy phrase! You said yourself that capitalism doesn't incentivize those with to help those without. How do you augment capitalism to incentivize such behavior?

And to be clear, I'm not saying that the bourgeoisie should be helping. I'm just saying they don't. I think we should burn capitalism to the ground and build something better.

lxgr · on June 21, 2024

> How do you augment capitalism to incentivize such behavior?

By regulating it so that it serves the needs of the people living in it, and embedding those regulations in a system resilient against regulatory capture and nepotism.

As all economic systems, it’s ultimately only a means (efficient resource allocation) to an end (prosperity and growth for somebody).

Who that is, i.e. how the results of that growth are ultimately redistributed is not a question of economics but politics.

apsurd · on June 21, 2024

augment is the hopeful outcome. The alternatives seems quite bad to say the least. Maybe they lead to something good, something better. but there's a whole lotta bad in between for an unclear amount of time.

ianbutler · on June 21, 2024

> How can you call a system that disincentivizes collective care of one another "amoral"?

It doesn't disincentivize collective care. It incentivizes resource allocation most optimally, whatever optimality is within applied bounds. Which leads to my next point:

> What a pithy phrase! You said yourself that capitalism doesn't incentivize those with to help those without. How do you augment capitalism to incentivize such behavior?

By applying regulation and laws to the system, the bounds. It's not like any of this exists in a vacuum and can't be, again, augmented to fit our needs better.

> And to be clear, I'm not saying that the bourgeoisie should be helping. I'm just saying they don't. I think we should burn capitalism to the ground and build something better.

That's cute. What's your proposition for replacing Capitalism? What guarantees your new system is better? And I can assume you're agreeing to pay the incredibly bloody tab destabilizing our society will have? Since I assume from your post you basically haven't known anything other then peace and stability and historically speaking great times, you're basically a larper.

Ideas like this always strike me as incredibly naive. Why not work to improve a system that has shown promise instead of the repeating the long trail of critical failures from other systems.

redwoolf · on June 21, 2024

> It incentivizes resource allocation most optimally, whatever optimality is within applied bounds.

Exactly. And in the current bounds the optimal allocation is more into the control of fewer. Also, capitalism is a zero sum game, thus the disincentive to give what you've got.

> By applying regulation and laws to the system, the bounds.

Sure. But those regulations and laws are being repealed or revised to be toothless. And there are many places on earth where laws have no meaning.

> And I can assume you're agreeing to pay the incredibly bloody tab destabilizing our society will have?

And capitalism doesn't have an incredibly bloody tab? Take a global perspective to see the many useless wars in the 20th and 21st century that have been fought to extend the market for capital.

Or consider that our planet is fucked by global warming thanks to "optimal resource allocation" which takes no account for long-term consequences, but only short term gains. Do you think that when the oceans rise, the plants die, and the water is too acidic to drink society won't be destabilized?

lxgr · on June 21, 2024

> capitalism is a zero sum game, thus the disincentive to give what you've got.

Only in a zero growth environment. And in one, which economic or political system is not a zero sum game?

> consider that our planet is fucked by global warming thanks to "optimal resource allocation" which takes no account for long-term consequences

Pricing in long-term consequences and accounting for externalities is arguably not a contradiction to capitalism (and I think we should absolutely do much better there).

Ideally we’d even account for the effects of societal unrest due to massive wealth inequalities or areas becoming unlivable due to the climate, or just the immorality of poverty if nothing else.

But that’s a political decision (what do we value how much) in the end, not one of economic systems. Who gets to decide that is determined by the political system, not the economic one. And corruption of political power can exist in all systems.

redwoolf · on June 21, 2024

I’m not mad at what you’re saying. But politicians, in America at least, have no stomach for any such adjustments to the status quo.

numpad0 · on June 21, 2024

sustainable positive growth must lead to pure attention economy which is a form of uncivil meme nepotism

CaptainFever · on June 21, 2024

> Also, capitalism is a zero sum game, thus the disincentive to give what you've got.

Bullshit. If this is the case:

1. We will still be stuck doing subsistence farming.

2. No trades will happen.

3. The economy will not grow or shrink.

This is so obviously wrong. Please learn basic economics before trying to "burn it all down": https://www.investopedia.com/terms/z/zero-sumgame.asp

kapperchino · on June 21, 2024

Bruh stop with the middle class bs, there’s only working class and the owner class.

dannyobrien · on June 21, 2024

I know it's not helping, and off-topic, but I cannot resist noting that the ecological niche in Hacker News comment threads, previously occupied by doctrinaire libertarians, have now been almost entirely re-stocked with doctrinaire Marxists. Nature is healing!

redwoolf · on June 21, 2024

The middle class is the designation for the part of the population the owning class has duped into thinking they are temporarily embarrassed millionaires.

labster · on June 21, 2024

I’m not sure if you’ve seen housing prices lately but most of the middle class are technically millionaires now, just of the “land rich, cash poor” variety.

redwoolf · on June 21, 2024

Back that "most" up with facts. I'm not sure how much we can trust Zestimates.

vsuperpower2020 · on June 21, 2024

[flagged]

redwoolf · on June 21, 2024

You missed the point. There is no “middle class.” There is the proletariat and bourgeoisie.

mensetmanusman · on June 21, 2024

6-figure AI scientists are middle class workers increasing the roi on investments made by billionaires into gpu capital.

arh68 · on June 21, 2024

> respect signals like subscription paywalls, the robots.txt file, the HTML “noindex” keyword, terms of service, and other means by which copyright holders signal their intentions.

And if they disrespect those signals & terms, and lie, what then?

> the copyrighted content has been ingested, but it is detected during the output phase as part of an overall content management pipeline.

Yes, but how can it be detected reliably? Considering there is much to be gained in fooling us. (think parallel construction)

protocolture · on June 21, 2024

>Yes, but how can it be detected reliably? Considering there is much to be gained in fooling us. (think parallel construction)

I was fooling around with a fan constructed addon training module for novelai.

I had a blast, reconstructing a few different narratives and sort of melding them together was a lot of fun.

I let the tool name the characters. And the names it came up with were better than halfway decent. So I kept them.

Turns out that while novel ai makes some sort of best effort to remove copyrighted proper nouns, the additional training module had reinserted some. After it had selected the first, it immediately selected his brother for the next one.

If I hadnt gotten suspicious and googled the names in depth, I might have spread the story around. It wasnt ever destined to be published but I could see people falling into the same trap.

o11c · on June 21, 2024

The linked article cited Youtube's Content ID, so ... clearly reliability isn't expected.

c1sc0 · on June 21, 2024

I sometimes worry that qualms about copyright & ethics will make us lose the machine learning arms race with China. If « The unreasonable effectiveness of data » still holds true then we are in big trouble.

lithiumii · on June 21, 2024

Don't worry, in China they need to worry about a different set of "ethics". It's not just your LLM should never speak against Xi. They probably also need to include a big propaganda corpus so that the models are aligned with CCP's value, which is very likely to be bad for reasoning capabilities.

silver_silver · on June 21, 2024

> which is very likely to be bad for reasoning capabilities.

As opposed to our aligned models which refuse to answer questions with controversial answers?

snowwrestler · on June 21, 2024

I wouldn’t worry about it; in the U.S. the people with the qualms are not the people who are actually building the machine learning systems.

SV_BubbleTime · on June 21, 2024

Of course they aren’t the customers. That doesn’t stop them from having a disproportionately loud voice in the matter.

Once you really understand what ESG is, you will understand why every company gets a rainbow logo this month, why companies seem eager to push their actual customers aside, and why the constantly-offended are catered to instead of mocked.

Look at every AI press release. They all use “safety” over and over. Without that dogwhistle, they wouldn’t be in line for investment funding.

SV_BubbleTime · on June 21, 2024

No need to worry about it happening, it already has. In terms of diffusers alone:

Stable Diffusion 3 came out recently and immediately fell flat on its face. Terrible anatomy and basically unusable for human forms unless you use so many “unsafe” negative words that it’s “safety” training doesn’t destroy the result.

In the meantime… Chinese models Lumina, PixArt, and Hunyuan are all Chinese projects or heavily contributed by Chinese researchers and companies. These are rapidly gaining steam. They’re far less “safe” with far less lobotomization.

We are already losing the AI race in this realm exactly because of an over-reaction to “safety”.

zarzavat · on June 21, 2024

There’s three kinds of AI safety:

1. “AI is going to (help someone) take over the world”. We don’t need this yet but it’s better to think about it before you need it rather than afterwards.

2. AI is going to do genuinely bad things: generate CP, teach people to make dangerous chemicals.

3. AI is going to do things that people in California think are dangerous: generate a picture of a boob, generate picture of a group of people who are not a mixture of different ethnicities, or sexes, give an answer to a question that isn’t aligned with how people in California think.

China is not concerned with any of these, although it is concerned with its own version of point 3:

3b. Generate criticism of the CCP or anything that the CCP doesn’t like people to know about.

musicale · on June 21, 2024

It seems like the cat is out of the bag, and it may take a catastrophic disaster before any risks are taken seriously.

dorkwood · on June 21, 2024

You think OpenAI, Meta or Microsoft care about those things? They've already trained on all the world's copyrighted data without asking for permission. Any hand-wringing you see from them is performative.

exodust · on June 21, 2024

Why should AI need permission to perform the equivalent of "human looking" at content? We humans don't need permission, why should AI?

I could paint a picture of a Disney character after looking at nothing but public promotional material already out in public view, online or in print etc. The tools I use to paint with are capable of all the same colours and lines used in Disney characters. We don't blame the tools for that, but we seem to be blaming AI for being too good?

diputsmonro · on June 21, 2024

Selling a picture you made of a Disney character is IP infringement. If the work you create is substantially derivative and competes against the IP holders work (which is pretty arguable given the prevalence of prompts containing the names of specific artists/characters/companies), then it is infringement. The same would be true if I paid you to produce art/merch/etc. of a Disney character.

exodust · on June 23, 2024

I didn't mention selling the picture I made of a Disney character. I mentioned only drawing it.

Likewise, AI can draw the Disney character. You as prompt author may have no desire to sell or share the drawing. It's your karma if you choose to cheat, deceive or steal in any areas of life.

I am not a fan of AI-gen "art". But I object to pitchfork policy, banning and crippling AI from training on actual things in the world - IP protected or otherwise. I think AI should see everything we humans publish, so it knows us better than if we limited its learning.

diputsmonro · on June 25, 2024

> I didn't mention selling the picture I made of a Disney character. I mentioned only drawing it.

If you are paying for an AI service to produce art based on your prompts, then they are selling the art to you, rergardless of what you do with the image afterward. It's no different than going to an artist, describing an image you want, and paying them for what they produce. Generating images with propmts doesn't make you the artist, it makes you the commissioner. And if that commission involves a copyrighted character, then you have a legal issue.

> I think AI should see everything we humans publish, so it knows us better than if we limited its learning.

In an ideal world, I would agree. Unfortunately, the pressures of capitalism change the circumstances somewhat. As evidenced by the board shakeup, OpenAI doesn't care about lofty goals of bettering humanity, they want to make money. They built a plagiarism engine that will steal and sell the work of millions of artists, putting them out of business. In a better world where art production was not tied to their ability to feed themselves, then fine. But in this one, we're destroying the livlihood of millions to make some tech bro theives richer and flood the world with shitty "art". We have to consider the impact that technologies like this will have on society, given the way society is currently organized.

CaptainFever · on June 21, 2024

Is this a good thing? Taken to the logical conclusion, fandoms are illegal.

diputsmonro · on June 21, 2024

What if we flip the script - would it be cool if Disney scraped DeviantArt and created merchandise, movies, etc. based on the art and characters of small creators? Rules need to exist to protect the little guy.

Technically a lot of the merch people sell at conventions is infringing, but Disney, Nintendo, etc. recognize that pursuing action against them would ultimately be worse for their brand.

Generally speaking, whether a fandom existing around a creator's work is good is ultimately up to the creator. If the fandom economy chokes out the product of the artist, then the artist should have legal recourse (this is effectively the same scenario as counterfeits / knock offs). But if the creator's work is able to flourish alongside the fandom, then the creator may well decide that keeping the fandom engaged is worth more than being a legal prude. The protection should exist legally, and the creator should be allowed to exercise it as they deem necessary.

exodust · on June 23, 2024

> would it be cool if Disney scraped DeviantArt and created merchandise, movies, etc. based on the art and characters of small creators?

You've more than flipped the script. You've introduced a global corporation as the "prompt author", feeding off the work of small creators. It's not a fair equivalence because suddenly the "prompt author" has major distribution partners, reach, budget and branding to propel their deception around the world.

Where exactly are all these plagiarising AI-gen artists anyway? Professional concept art involves highly intentional detail, themes and thinking. AI-gen renderings are more like rolling a dice and saying "cool! check this out!" It's not the same game.

YeGoblynQueenne · on June 21, 2024

Are you saying I can make money from my Vader/Popeye fic?

dorkwood · on June 21, 2024

> Why should AI need permission to perform the equivalent of "human looking" at content?

Why should a car need permission to perform the equivalent of "human walking" down the sidewalk? A car converts fuel into kinetic energy and uses friction to propel itself forward, just like a human does.

The thing is, we don't refer to what cars do as "walking" because the difference is very visible to us. The difference is not so visible with ML algorithms, so people keep trying to compare them to humans, when they're not.

We need a new vocabulary to describe what these ML algorithms are doing with data.

exodust · on June 23, 2024

So... You're likening AI to cars that need registration to drive on roads; in other words AI needs permission to train on the world's content. I like this analogy.

You had me stumped for a minute. But whenever a car drives from A to B, the "damage" is done. Fuel spent, occupants transported, pedestrians given way, road surfaces slightly worn. When AI does its training thing, no damage has occurred. There is only potential damage later if someone decides to misuse what AI has learnt. I wonder if this makes me an optimist. I want AI to know more, because it will be better for us humans when we use it responsibly.

mistrial9 · on June 20, 2024

this article hits many nuanced points well.. and it steers towards a future I can support as a WEIRD (common acronym for Westerners with a higher education?).

too many detailed parts to respond to all of them in a brief reply.. I don't think these recommendations will apply to many parts of the non-West world..

here in the USA, this content makes a lot of sense to me

rickydroll · on June 21, 2024

This is a good article, and I like the nuance it conveys. My takeaway from the article is that it highlights the need to identify what sources of information go into making a response. If a response uses the result of training on 10,000 documents and costs $0.0001, then every document gets its 1/10,000 fraction of 1/100 of a penny.

When there is a payout scheme with information source traceability, I'm willing to bet that no-charge information will be used in preference to everything else. Alternatively, we could stop the copyright thuggery, treat training data as a worldwide Commons resource, and distribute the profits from LLM companies to sovereign funds.

hammock · on June 21, 2024

> When there is a payout scheme with information source traceability, I'm willing to bet that no-charge information will be used in preference to everything else.

This did not happen in music.

lmm · on June 21, 2024

It's gradually happening. Look at how many shops and restaurants today are playing cheap covers of recent hits rather than those recent hits themselves.

musicale · on June 21, 2024

Cover versions still require public performance licenses.

Restaurants and bars typically pay for licenses from organizations like ASCAP and BMI, who distribute royalties to songwriters (rather than artists or record companies). Those licenses can cover recorded music playback and/or live music (usually with higher fees.)

However, streaming and satellite radio stations have their own license regimes, which may include royalties to artists and to record companies (for streaming at least). So a shift to cover versions (perhaps owned by the satellite or streaming network itself) could lower payments by the network, so that could provide motivation for cover versions.

lmm · on June 21, 2024

Right. I think it's probably Spotify or equivalent somehow nudging them onto playlists of cover versions that are cheaper for them rather than the shops and restaurants explicitly making the choice. But it's a trend I've noticed that I thought was interesting, and suggests that cost is having an effect at some point in the chain.

muzani · on June 21, 2024

Is this really the case? I'm a big fan of covers, and from what I've heard, nearly all of them share revenue with the original copyright holder. Except Hotel California - nobody is allowed to cover anything by the Eagles.

pavon · on June 21, 2024

The Eagles have no say in the matter. Anyone in the US can legally record or perform cover songs as long as they pay government-defined statutory licensing fees.

rickydroll · on June 21, 2024

> This did not happen in music.

Copyright thuggery is the reason why

apantel · on June 21, 2024

One point that stood out to me that I agree with is this:

> It is certainly worth knowing what content has been ingested. Mandating transparency about the content and source of training datasets—the generative AI supply chain—would go a long way towards encouraging frank discussions between disputing parties. But focusing on examples of inadvertent resemblances to the training data misses the point.

I think that’s a good idea because that at least opens the possibility for affected content creators to knock on the AI company’s door and demand parley.

m3kw9 · on June 21, 2024

What happened to the generating training data hype train?

numpad0 · on June 21, 2024

These[0][1] tweets showed up to my timeline recently. I don't know it's just an anti-AI luddite propaganda or not, and I do find people who do definitely not fit below caricaturization, but it seem to resonate with my sentiments and geostationary orbit gist of the matter:

  David Holz mentioned in today’s Midjourney office hours that they have more customers over 45 than under 18. He said you’re more likely to run into a 65-year-old woman than a teenager in MidJourney’s community.I think that’s a really interesting point about generative AI. It’s a bunch of old people who are telling you that it’s some mind-blowing invention. Young people are largely uninterested. They think it’s boomer art, not the guaranteed technology of the future. It’s only cool to olds.

  The internet was never like that—old people haven’t traditionally driven the culture. Tiktok and YouTube are often associated with younger people. Popular music and movies too. AI seems to only appeal to people who are past their prime. It lacks the It factor for people under 40.

  0: https://www.threads.net/@thebrianpenny/post/C8aTGjUycs-/
  1: https://twitter.com/chiefluddite/status/1803704263148970255

diputsmonro · on June 21, 2024

> Meanwhile, the AI model developers, who have taken in massive amounts of capital, need to find a business model that will repay all that investment.

...

> The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots. And that means they have to start from scratch. They have to rebuild everything they’ve built. So this is something that not only imperils what they have today, it imperils what they want to build in the future.

...

> "the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data."

The simple and correct answer then is that they shouldn't exist. I don't care how much money they spent building a plagiarism machine, the expense and desire to build it doesn't excuse it being a plagiarism machine. Rich VCs don't get to ignore the rules just because they think they found a neat hack to make them billionaires. And after the shenanigans with the OpenAI board, the altruistic "for the betterment of humanity" argument has been revealed to be a thin sham.

exe34 · on June 21, 2024

the confetti is out of the cannon now.

renewiltord · on June 21, 2024

> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paid

Bullshit. I have given many a book to a friend and they have passed it on. If this statement were true, then O(n) payments would have been made.