Hacker News new | past | comments | ask | show | jobs | submit login

If you scrape my hobby website about photography, scuba diving or let's say baking or gardening which improves your model by let's say a delta of 0.00000000001 than shouldn't I get some free credits to use that model or proportionate share in the revenue stream?

EDIT: scuba diving NOT scooba diving




Counterpoint (not just to be annoying — I think you pose a very interesting unanswered question):

If I read your hobby website about photography and use it to take 1% better pictures, do I owe you 1% of what my clients pay me?

I think that probably most people would say no, assuming you could even determine that 1% in a way that both parties agreed was fair. I think generally, we have an understanding that some stuff is put out into the world for other humans to learn from and use to make themselves better, and that they don’t owe the original authors anything other than the price of admission.

I guess it comes down to this: do we think that training a model is:

- like storing and later reproducing a version of some collected data, or

- like learning from collected data, and synthesizing new info?

Is there even a meaningful distinction, for a computer?

(Is there even a meaningful distinction for a human…?)


This is a very thought provoking point and it throughly stimulated me to think deeper and through. Purpose of my website is threefold, document my own knowledge, maybe some vanity and the urge to give back something to "someone" make a better living or similar.

Things get interesting at corporate scale. There are fat VC funds, executives, board of directors and what not - making far more and far more comfortable than an individual trying to get better at their craft to put food on the table. And on top of that, you don't give me access to the product that was refined on my input.

It is like someone learning photography from my website but later taking a really masterpiece shot but asking me for money each time I want to view the photo in their studio.

There are no easy answers, I concur.

Thanks for your comment though, really. :)


Yes, it is interesting. To me, the important thing is that our labour is exploited in many more (and many more malicious) ways than making an LLM 0.000001% better, maybe (or maybe it makes it worse!). Therefore, the problem isn't the AI, it is this giant financial machine which sucks value out of all who actually produce it, no matter what tools it uses to do so.


I doubt the number of content creators will increase or even stay constant if they know that only AI models will continue "reading" them.

> do I owe you 1% of what my clients pay me?

I would still derive some immaterial gain or satisfaction from you reading my website specifically and using what you learnt to improve yourself. As I expect most people would, so it's still a give and take relationship. LLMs sever that link.

It is doubtful many people will be as willing to continue "putting stuff out into the world" if they know that they are only contributing to some sort of (arguably semi-dystopian) hive-mind.

IMHO whether what they are doing or not is justifiable from a legalistic perspective is tangential and not that relevant if we're talking about free/non-commercial content.


> LLMs sever that link

Do they though? I mean, do you personally have a link to the people that are consuming the content you post publicly?

I find all the vitriol around LLMs being trained on public data to be a bit weird. If you don't want that data being used then don't publish it for the world to see? Why get mad when you are the one freely publishing the data in the first place? That's like posting your content on a bulleting board in the dorm common room and telling the trust-fund kids they can't read it because they are rich and you don't want them learning anything from you that might make them richer. Maybe a bad analogy, but I feel like it's a fair approximation of the vitriol I see.


It should be treated as learing. If it truly stores and reproduces a photo (to some high accuracy), then there are already laws in place that handle this. Your client using the output may infringe on the photographer's rights, which may fall back on you depending on your contract.

If I watch a youtube video my browser is also in a way scraping youtube and storing a (temporary) copy of the video. Does it make sense to protect the protect the owner's right's at this point? Absolutely not. Instead we wait to see if I share that downloaded video or content from it again, or somehow reuse it in my own products. Only then does the law step in.


The distinction is scale at which OpenAI can make profit off of your work. Now this might sound trivial, but it's scale of fraud possible has been the biggest argument against online elections.


Interesting point though I'd go with another analogy.

You can go to a library to borrow a book, but you can't go to the library and copy all the books for your own use.


I used to go to the library, find books with the relevant chapters related to what I wanted to learn, and the librarian would photo copy all the pages I wanted to take home. So I guess technically you could copy all the books for your own use.

It's just impractical to photocopy every page of every book in a library.


When I was in libraries you could photocopy a percentage of a book (15% maybe?), although I doubt it was enforced. One could do many trips, but it is impractical, as you say.


Not really sure that this analogy applies, because I could definitely photocopy as many books from the library as I physically can. No one is going to stop me.


As far as I'm aware, photocopying an entire book does in fact violate copyright law and librarians will refuse to help you do it: https://guides.cuny.edu/cunyfairuse/librarians


Well it's not so much about the physical act of doing it, it's the trying to convince the world it's for your own private use and not for commercial gain.

Otherwise, intellectual property laws can perhaps apply.

It'd be a hard push to claim it's fair use, a wholesale copying of other's works.


I would put it a little differently.

You can actually copy all the book but the things is you can't publish it as your own book after you copied. Because obviously it is not your work.


> You can go to a library to borrow a book, but you can't go to the library and copy all the books for your own use.

Why? What's stopping me from doing that? The only limitation is time.


If the library owned an effectively infinite copies of each book why wouldn’t they let you borrow one copy of each book?


The online library known as archive.org tried exactly this. They got sued, to no one’s surprise.


Because authors and publishers wouldn't be very excited about that and would lobby governments to limit that (and I 100% believe they would be right to do that).


You can do that. Google literally already did that.


?? You can it would take a long time but you could.


I think we need to put this argument in terms of consent and actual harms caused. Human artists are generally down for other human artists to learn from their art and use their stuff as a reference for the purpose of learning, because the next artist generally will have their own style from their own quirks in muscle memory, skill, experience, etc. That contributes meaningfully to Art and keeps the field alive by allowing new artists to enter the field.

AI training is basically only extractive and has the potential to severely disrupt the actual field that made the AI systems possible at all. It's a much more mechanical process that the human interaction of studying a master. It doesn't develop any human skills.

Even if the processes were the same (and I don't think they are, as someone who has actually done computational psychology research), I would still think the AI companies are doing something they know is harmful to actual creative people that generate real value.


What if very rich people came to your small free-entry photo studio to look at your pictures, and - perhaps because they have very fast jets - also go to every other photo studio in the world to look at every other painter’s pictures? Knowing this, would you still let them in for free?

I believe no. Most people would make a distinction between “normal” and “rich”. They would give normal people free access, but the rich should pay for it.

It’s like a billionaire asking for a free hot dog. It’s like “come on, you can easily pay $100, which could even sponsor it for the next 100 people”.

Here it’s not the AI itself that’s exploiting you. It’s the rich people that make the AI that get even richer - partly thanks to your free work.


I don't think we really even need to dive that deep into the philosophical aspect of this. I think that it's fine to simply treat humans and machines differently, the same way we decided that animals cannot hold copyright for a work.

The reason copyright law exists in the first place is due to the difference of scale between copying books by hand and using a machine to do it, so I think "it's different because a machine is doing it" is a completely rational stance to take.


I think there is a straight distinction that can be made. With humans, you can't determine if or how that information will be utilized. With any machine, you will. It's practically a copy. If it's only storing derivate information, if there is fuzziness, that's intended.

Far in the future - if ever - where we have biological grade artificial beings which you can't program, control and limit in the classical software development sense, this could be rethought.

Until then, we don't need to humanize machines.


I know very few altruist humans. Whenever someone puts up some content online I believe there is always some motive from the author to benefit themselves even if it's subconscious. Perhaps through ad revenue or exposure from their blog/OSS project or just the dopamine of fake internet points from answering questions on forums. A human may particularly like your content and keep coming back to it or spread it with attribution.

But you don't get any of that from an LLM.


An ai is not a “you”. It’s a peace ld software that steals data, rinses it, and monetises it. There is no human like learning.


Even if training a model turns out to be similar to human learning, I don't think it necessarily follows that it should be treated the same, legally or morally. There's nothing wrong with human laws or morals that enshrine human behavior, like the human way of learning, as special and distinct from machine learning.


Better analogy would be: I read your hobby website and start a photography section in my Q&A website based on what I've learned from your site. That leads to a 1% increase in my revenue.


I think that's where references is important. We do more to the world by giving credits, I think it is the same for computer?


You don't have to publish anything on the internet. And when you do, you may limit the allowed audience to just the group of your friends etc. Why publish anything if you worry that someone may consume it?


ChatGPT is not "someone", it's a black box that will ingest everything at his disposal and can't tell you where he gets the information from.

The moral thing to do would be to use opt-in training data.


I'm sure this technology is going to dissuade some people from publishing. Why bother if it is going to be regurgitated to everyone and their dog for $10 a month.


Why? I wouldn't pay you for marginally improving my baking skills either.

It is an interesting question. I would have no qualms paying for a textbook or university course for curated learning (worth noting OpenAI has paid datasets too), but paying for (or being paid for) relatively diffuse and low quality content through hobby blogs seems at odds with my expectations as an individual, and as a society we were never (en masse) concerned about things like Google's search excerpt answers...


But one of my unstated goal is to improve "YOUR" baking skills. That pays me off in satisfaction nevertheless. You might refer me somewhere later on so that pays off or I might have some ads that you might see so that's there.

With a gardened proprietary paywalled model, what I wrote ends up as some constituent of giant arrays of floating point numbers which I must pay to use.


Because perfect information transfer isn’t usually possible by a human reading a book or website, whereas computer systems can usually do that.

If humans could perfectly remember information, I’m sure copyright would be very different.


But a model learning from data and reproducing it in some fashion is absolutely not perfect information transfer.


But humans can memorize information, it's always a possibility for any work. Meanwhile, LLMs don't record things the way computer systems normally do.


You might not pay, but ad revenue might.


Every response so far is no i.e. the hobby website doesn't merit any compensation.

A contrarian take to support the original commenter is that if the site owner had ads, i probably got him or her some increment in site visits and helped in some small way with monetization, site ranking and boosted his / her public persona, credibility.

When GPT bot visits, none of that happens. Much worse - people who might have visited the hobby site and contributed to traffic and ad revenue will now start getting their answers from the OpenAI chatbot and never visit this hobby site.

That's exploitation and I think that's what most of the responses on this thread miss.


I like your pov and I pretty much think the same. Copy for learning is different from copy and publish as your own.


The responses are nihilistic libertarian, as is typical here.

When a private company takes the sum of human knowledge without permission, attribution or payment and then monetizes it via the back door whilst cutting of any connection between the intended consumer and publisher, then we're dealing with a system I'd describe as criminal. It cannot be morally defended as "fair" in any major economical or political system.

The fact that they call it "Open" AI shows the level of trolling involved.


The LinkedIn case has already established the legality of scraping, so this argument falls flat too.


I don't think it's so much about legality rather than maintaining incentives (both financial and immaterial) for people to publish high-qualicontent that's available publicly.


scooba diving

This is a bit pedantic but the term is "scuba diving". Scuba is an acronym that's short for "self contained underwater breathing apparatus". It doesn't work if you don't spell it right.


ChatGPT bot detected


If I learn something from your StackOverflow answers, do you expect me to share a percentage of my future salary with you?


SO answers are explicitly licensed under CC-BY-SA 2.5/3/4, depending on the time it was posted https://stackoverflow.com/help/licensing. So no.


But are you human or not ? Because rights and laws that apply to humans do not necessarily apply to objects and vice versa. I don't expect a building permit from you when you stand on a piece of land. LLM's aren't legal entities in the formal sense, are they ?


It is not learning, it is about publishing.


Can you share your knowledge with millions at once?

If so, then pay.


So there's an infinite pyramid of "who learned what from who", and payment flows upwards along the hierarchy, all the way back to people who are long dead, and then down to their descendants who presumably inherited their "knowledge rights"?

You can't be serious. Thank god our world doesn't work like that.


It's not about who learned what from whom, it's about superstar economy. If you serve all customers and leave nothing for the rest, it will be a problem.

What do you think why writers and actors have included AI in the reasons of their strike?


> What do you think why writers and actors have included AI in the reasons of their strike?

Because they are about to become obsolete, and they believe that screaming as loudly as they can is going to stop that.

Their chances of success are roughly the same as if they were protesting against the law of gravity.


The fun part about being a strong believer of AI and actually understanding its capacities is being able to tell when people are completely blinded by hype.

AI will not make writers “obsolete”, that is utterly absurd. Would you say reality TV made tv writers obsolete? No? Oh well.

You get what you pay for. That includes what you pay for as a producer…


> AI will not make writers “obsolete”, that is utterly absurd.

Of course. And those so-called "computers" won't make human calculators obsolete. After all, they are as large as an entire room, and by the time they are ready to receive input, a human with his slide rule has already computed three and a half entire logarithms!

Human creative professions have 5-10 years left, if they are very lucky.


> Human creative professions have 5-10 years left, if they are very lucky.

So in that sense do developers have ~2 years left? Code is much more rigid than acting or creative writing and AI seems to be getting there first. I mean if the all-powerful AI can make modern movies than clearly it can handle writing all code right?


Call me crazy, but the AI generated Seinfeld brought me more entertainment in the last 6 months than anything Netflix has produced in the last year.

I think they're _very_ worried and rightfully so. I assume it would be very difficult to cancel an AI.


Maybe you can elaborate more on why you are more entertained, then some producers from Netflix can take note and improve.


Not as obsolete as their bosses/owners have long been.

We make our own rules. We decide what to allow and what to value. If technology changes something, it's because we let it.


> Can you share your knowledge with millions at once?

Yes, I might run a course or something. You are still not entitled to pay.


I mean yes? Answer a bunch of stackoverflow questions and you'll hit that.


Isn't that what everyone who writes on the internet does with every tweet, toot, blog post, vlog, podcast, short, reel, and comment?


Since that delta is clearly a transformative use of your photo -- as in, the output doesn't even remotely resemble the input -- you don't have any legal claim to it, no. I'm not sure what the plaintiffs arguing otherwise are smoking if they think they can argue it isn't transformative.


I don't think it's that simple. In order to have a claim to fair use, you would have to argue that the derivative work doesn't negatively affect the market for the original. When Google got sued for scanning copyrighted works for Google Books [1], they could claim fair use since they were only letting people see small excerpts from the books.

If you can train your bot on my blog post about scuba diving without my permission and then people can ask your bot for scuba diving advice instead of reading my blog, that doesn't seem very fair.

[1]: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


> I don’t think it’s that simple. In order to have a claim to fair use, you would have to argue that the derivative work doesn’t negatively affect the market for the original.

No, you don’t.

That’s a factor weighing in favor of fair use, but the fair use factors are not defined in such a way that that is a necessary factor.


This doesn't appear consistent with other visitors to your website. If a cafe owner uses info on your site to improve their baking, should they also be required to share their revenue with you?


If the cafe is a multi-billion corporation that can only exist because it can leech of content created by millions of other people without providing anything at all to them in return (and I'm not necessarily talking about financial compensation) then yeah.. maybe you should.


So Starbucks? Should they be sharing all their revenue with whoever invented all those Italian coffee drinks?


Starbucks business model is not entirely (or at all) reliant on the availability of new coffee drink recipes which can only be provided by third parties. So no, I wouldn't say so.


Are you considering ad revenue?


That already puts the website in the sleazy category. "I mixed my helpful information with mind poison" isn't a strong position to argue fair play from.


You missed the more general point that if folks do have a way of making revenue from their content then stealing their content would have a negative impact. Maybe someone has amazing content and offers classes. You might be able to think of other possibilities.


I see no point in engaging with this argument. ChatGPT is not a human. I, nor anyone else, should have to explain to you why that makes all the difference here.


If you don't want to engage in the argument, that's on you. I don't think ChatGPT not being a human makes any difference and I think the onus is on you to explain why it should.


No the onus is not on the person thinking laws written for humans apply only to humans. That doesn’t make any sense.


Now you're shifting the goal posts. Please re-read the comments/replies up to this point and you'll see no mention of laws anywhere. That's not what the discussion is about. It's about whether AI consumers of publicly accessible content should be required to pay for that content when human consumers should not.


> shouldn't I get some free credits to use that model or proportionate share in the revenue stream

But you do get paid in kind - you "gave" information for the AI to train on, the aI gives you information back, contextualised to your needs. Sometimes those 1000 tokens are worth much more than $0.06

You still need to be able to pay for inference costs, it's crowded and expensive on GPUs nowadays.


Since it "gives" the same information to everyone there aren't really that many incentives for you to allow LLM to use your content. "tragedy of the commons" and all that stuff...


If your website appears on the search results of Google and they show ads next to it, aren't you entitled to that revenue too?


Google allows me to limitlessly search their index that allows me to find other pages too and in turn, they sell my attention so it is somewhat fair proposition in contrast to a wall gardened AI model being charged by per token such as GPT 4 that includes my content as well.


Can you not use ChatGPT as well?

I think you'll find if you do try to push Google Search too far, its not quite "limitless" either.


GPT 4 isn't free. On individual and human scale, Google search is virtually limitless. I've been sometimes presented with Captcha when frantically searching something but that too is in distant past like late 2000s

Hasn't happened in a long time.


quick back-of-envelope/googling:

openai is worth $29,000,000.00 you contributed 0.00000000001

punches numbers in calculator

thus the value of your free credits is 0.001 cents. minus any accounting fees.


>openai is worth $29,000,000.00

You might have missed a few zeros.


He doesn't believe they have a moat, seemingly!


Much of that is Azure credits and not real money.


I find it very strange to think that you are entitled to anything in return when something views and processes content you have publicly shared.


I think what you are doing is great. I would say current way of ChatGPT isn't ideal. Harvesting the data without giving credits. The tech now is great but I believe there is a way for all to WINs.


What if your website contains incorrect information that makes their model worse?


Ahh DarkHatAI-Patterns almost like DarkHat SEO-Techniques.


Depends on what the courts say. We'll have to see.


OpenAI would love that kinda regulation, it would basically kill free models.


Gotta pull the ladder up after you if you really want to maximise profits.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: