Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I feel the opposite way about the eye. The data hosted by the eye isn’t theirs to give away. Their approach feels disingenuous.



I struggled with similar feelings. But, what convinced me, is that they take great care to remain compliant with DMCA. Pirates don’t.

The difference with The Eye is that they thoroughly vet the legitimacy of each DMCA claim. The claimant is required to show proof that they are the legal copyright holder. And The Eye seems willing to call bluffs: on more than one occasion they have dealt with bogus DMCAs in ways where normal companies would simply give up.

Ultimately, their actions are legal. And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel. The reproducibility situation is getting worse each week, and much of that is due to the fact that realistic datasets can’t be distributed without fear of reprisals.

That said, I respect your feelings on the matter too. I think it’s equally valid to feel uncomfortable. I just take solace in the fact that it’s legal.


Copyright is an issue because we have seen recently that large language models could reproduce training data verbatim.

But the other issue is bias - there are some AI ethics people who object to using raw, unfiltered text for its biases. For example the reddit dataset is considered no-go. I have no idea what to replace raw text with, or even if we could get unbiased text of this size, and what kind of bias detection rules to apply, because AI ethics is still a nascent field but recently it has gotten dead serious.


Whenever someone uses the term "bias" in ML in anything other than its statistical sense (E[\hat y - y]), it's helpful to mentally replace it with "opinions I disagree with" and see if the argument still makes sense.


Well, if the goal is to produce an example of how one might train a language model, I think it's fine to ignore algorithmic bias issues.

But when people are talking about making e.g. AI chatbots to help teach in classes, or customer service bots, I think it's perfectly reasonable to imagine that research effort is useful to ensure the bot isn't acting like Tay did...

(And more broadly, if you're trying to use an ML model to e.g. screen candidates for hiring, you better be damn sure it's not causing discriminatory hiring practices.)


I think it's fine to build NLP models with any desired property you like, including "not leaking vulgar opinions that offend the courtly manners of your society." Makes plenty of sense.

But I do wish people would be more frank and self-aware about these purposes, rather than fig-leafing it as "ethics" or "fairness". By the time you've deliberately omitting the US Congressional Record on grounds of problematicity, it's worth asking yourself whether you're making your bot polite or doing damnatio memoriae.


Bias in the statistical sense is usually E[\hat beta - beta]. By which I mean there’s a specific aspect of this thing I’m trying to get. The whole field of causal inference is based on the fact that if you do things naively, you might mix your signals. Like how linear regression can get you biased or unbiased coefficients in different settings. Sometimes you need something like IV because just plugging in your data will tell you that ambulances are bad because it indicates the patient will more likely die even after conditioning on everything else catalogued.

It’s not opinions I disagree with, it’s aspects and behavior I don’t want, which is the statistical sense.


Bias in prediction, rather than parameter estimation, is a perfectly well established sense of the term. In particular, people doing language modeling are practically never concerned with identifiability, because you can't pick out one weight out of a trillion parameter model and say what it ought to be in the limit of infinite data.


But when people use the term bias in NLP, that’s what they’re talking about. They don’t want an aspect of the model to do something it ought not do. It’s a case of omitted variable bias causing things like the word analogy issues you hear about. Not an issue of bias in predicting the masked word.


If you create a model that outputs text, and that output contains opinions you disagree with, people are still going to act as if you agree, because it was you who created it.

So often "we can't use Reddit because lots of stuff on Reddit is offensive to somebody" does make sense.


> And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel.

I’m disagreeing as a member of the community. I think back to the original books corpus. It wasn’t Stephen King, but it was books from authors who gave away their content more permissively than King. It sucked how unreproducible it was, but that lack of reproducibility came from compliance with the authors wishes. If you compare it with the recent books3, the real difference is that the creator just ignored the authors.

> Ultimately, their actions are legal.

I worked on a project recently that dealt with the question of whether willingness to do a takedown was sufficient, and our legal team warned us that it isn’t. We threw away lots of work, which was a hard choice I’m proud of. The tl;dr was analogous to this: if your site just hosts dvd rips of all of the big studios’ movies, then Disney comes along and sends you a DMCA request which you comply with, you aren’t protected to hold everybody else’s movies just because they haven’t sent a DMCA request yet. That only protects you when it’s mostly stuff that uploader actually has the authority to upload in the first place. (It was much more legalese, of course, and I don’t understand it perfectly, but the gist is that DMCA compliance is necessary but not sufficient.) i.e. the Pirate Bay or old school Napster can’t just start honoring DMCA requests and be in the clear.

Putting the onus on creators to request a takedown is telling them to play whack-a-mole. As someone with a pipe dream of leaving ML and becoming a novelist, our community’s popular approach to IP embarrasses me. “Reproducibility is hard because the owners of the data don’t want us to pass it around” doesn’t mean to find a way to ignore the owners. Disregarding data owners’ wishes in return for better models is how the world has gotten into this privacy hell.

So many of us act as if, because copying something is essentially free, the thing should be. It’s embarrassing because our own output is in that form, so we of all people should recognize that the cost of copying it is the least important factor. Or we just call out fair use, as if that applies to everything under the sun. The eye does more, but for me their approach still makes me feel embarrassed that my own community applauds them.


It's always telling when you see a lot of down votes and no counterarguments.


Edit too late to edit: Reading through the The Pile paper they define public data as such:

> Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.

This should disqualify books3, but they use it to justify books3.

They need to extent that definition transitively. Maybe that’s my frustration. If I collect data from torrents and make it more freely and readily available, then it meets their definition of public data, which is basically what books3 is. If they included it themselves, it wouldn’t meet their definition but because someone else redistributed it first, it’s more okay for them to redistribute?


They've also scraped HackerNews posts. Since I posted blog links to HackerNews, does that mean they stole all of my blog posts? That represents three years of work and the chapters of 3 books that I intend to publish. They just took it and will start delivering it to their users to help them write more interesting content? Not okay.


If I Google “bibliotik download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

And yes, your last sentence is correct. That’s exactly the position being taken.


> If I Google “bibliotik download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

You're right. Then why do you exclude so much else that meets the same criteria?

If I Google “[insert dataset here] download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

In your definitions, you exclude vast swaths of things meeting that criteria in a seemingly arbitrary manner, which is why it's weird.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: