The Pile: An 800GB Dataset of Diverse Text for Language Modeling

sillysaurusx · on Jan 2, 2021

I'm sad they dropped Literotica from the dataset. It's available in the old "preliminary components" archive, if anyone wants to train a Literotica-Enron-Email GPT:

https://the-eye.eu/public/AI/pile_preliminary_components/

I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)

By the way, consider contributing to The Eye: https://the-eye.eu/

Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.

6gvONxR4sf7o · on Jan 2, 2021

I feel the opposite way about the eye. The data hosted by the eye isn’t theirs to give away. Their approach feels disingenuous.

sillysaurusx · on Jan 2, 2021

I struggled with similar feelings. But, what convinced me, is that they take great care to remain compliant with DMCA. Pirates don’t.

The difference with The Eye is that they thoroughly vet the legitimacy of each DMCA claim. The claimant is required to show proof that they are the legal copyright holder. And The Eye seems willing to call bluffs: on more than one occasion they have dealt with bogus DMCAs in ways where normal companies would simply give up.

Ultimately, their actions are legal. And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel. The reproducibility situation is getting worse each week, and much of that is due to the fact that realistic datasets can’t be distributed without fear of reprisals.

That said, I respect your feelings on the matter too. I think it’s equally valid to feel uncomfortable. I just take solace in the fact that it’s legal.

visarga · on Jan 2, 2021

Copyright is an issue because we have seen recently that large language models could reproduce training data verbatim.

But the other issue is bias - there are some AI ethics people who object to using raw, unfiltered text for its biases. For example the reddit dataset is considered no-go. I have no idea what to replace raw text with, or even if we could get unbiased text of this size, and what kind of bias detection rules to apply, because AI ethics is still a nascent field but recently it has gotten dead serious.

lambdaphagy · on Jan 2, 2021

Whenever someone uses the term "bias" in ML in anything other than its statistical sense (E[\hat y - y]), it's helpful to mentally replace it with "opinions I disagree with" and see if the argument still makes sense.

remexre · on Jan 2, 2021

Well, if the goal is to produce an example of how one might train a language model, I think it's fine to ignore algorithmic bias issues.

But when people are talking about making e.g. AI chatbots to help teach in classes, or customer service bots, I think it's perfectly reasonable to imagine that research effort is useful to ensure the bot isn't acting like Tay did...

(And more broadly, if you're trying to use an ML model to e.g. screen candidates for hiring, you better be damn sure it's not causing discriminatory hiring practices.)

lambdaphagy · on Jan 3, 2021

I think it's fine to build NLP models with any desired property you like, including "not leaking vulgar opinions that offend the courtly manners of your society." Makes plenty of sense.

But I do wish people would be more frank and self-aware about these purposes, rather than fig-leafing it as "ethics" or "fairness". By the time you've deliberately omitting the US Congressional Record on grounds of problematicity, it's worth asking yourself whether you're making your bot polite or doing damnatio memoriae.

6gvONxR4sf7o · on Jan 2, 2021

Bias in the statistical sense is usually E[\hat beta - beta]. By which I mean there’s a specific aspect of this thing I’m trying to get. The whole field of causal inference is based on the fact that if you do things naively, you might mix your signals. Like how linear regression can get you biased or unbiased coefficients in different settings. Sometimes you need something like IV because just plugging in your data will tell you that ambulances are bad because it indicates the patient will more likely die even after conditioning on everything else catalogued.

It’s not opinions I disagree with, it’s aspects and behavior I don’t want, which is the statistical sense.

lambdaphagy · on Jan 3, 2021

Bias in prediction, rather than parameter estimation, is a perfectly well established sense of the term. In particular, people doing language modeling are practically never concerned with identifiability, because you can't pick out one weight out of a trillion parameter model and say what it ought to be in the limit of infinite data.

6gvONxR4sf7o · on Jan 3, 2021

But when people use the term bias in NLP, that’s what they’re talking about. They don’t want an aspect of the model to do something it ought not do. It’s a case of omitted variable bias causing things like the word analogy issues you hear about. Not an issue of bias in predicting the masked word.

yorwba · on Jan 2, 2021

If you create a model that outputs text, and that output contains opinions you disagree with, people are still going to act as if you agree, because it was you who created it.

So often "we can't use Reddit because lots of stuff on Reddit is offensive to somebody" does make sense.

6gvONxR4sf7o · on Jan 2, 2021

> And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel.

I’m disagreeing as a member of the community. I think back to the original books corpus. It wasn’t Stephen King, but it was books from authors who gave away their content more permissively than King. It sucked how unreproducible it was, but that lack of reproducibility came from compliance with the authors wishes. If you compare it with the recent books3, the real difference is that the creator just ignored the authors.

> Ultimately, their actions are legal.

I worked on a project recently that dealt with the question of whether willingness to do a takedown was sufficient, and our legal team warned us that it isn’t. We threw away lots of work, which was a hard choice I’m proud of. The tl;dr was analogous to this: if your site just hosts dvd rips of all of the big studios’ movies, then Disney comes along and sends you a DMCA request which you comply with, you aren’t protected to hold everybody else’s movies just because they haven’t sent a DMCA request yet. That only protects you when it’s mostly stuff that uploader actually has the authority to upload in the first place. (It was much more legalese, of course, and I don’t understand it perfectly, but the gist is that DMCA compliance is necessary but not sufficient.) i.e. the Pirate Bay or old school Napster can’t just start honoring DMCA requests and be in the clear.

Putting the onus on creators to request a takedown is telling them to play whack-a-mole. As someone with a pipe dream of leaving ML and becoming a novelist, our community’s popular approach to IP embarrasses me. “Reproducibility is hard because the owners of the data don’t want us to pass it around” doesn’t mean to find a way to ignore the owners. Disregarding data owners’ wishes in return for better models is how the world has gotten into this privacy hell.

So many of us act as if, because copying something is essentially free, the thing should be. It’s embarrassing because our own output is in that form, so we of all people should recognize that the cost of copying it is the least important factor. Or we just call out fair use, as if that applies to everything under the sun. The eye does more, but for me their approach still makes me feel embarrassed that my own community applauds them.

dlkf · on Jan 3, 2021

It's always telling when you see a lot of down votes and no counterarguments.

6gvONxR4sf7o · on Jan 2, 2021

Edit too late to edit: Reading through the The Pile paper they define public data as such:

> Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.

This should disqualify books3, but they use it to justify books3.

They need to extent that definition transitively. Maybe that’s my frustration. If I collect data from torrents and make it more freely and readily available, then it meets their definition of public data, which is basically what books3 is. If they included it themselves, it wouldn’t meet their definition but because someone else redistributed it first, it’s more okay for them to redistribute?

nixtaken · on Jan 11, 2021

They've also scraped HackerNews posts. Since I posted blog links to HackerNews, does that mean they stole all of my blog posts? That represents three years of work and the chapters of 3 books that I intend to publish. They just took it and will start delivering it to their users to help them write more interesting content? Not okay.

stellaathena · on Jan 3, 2021

If I Google “bibliotik download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

And yes, your last sentence is correct. That’s exactly the position being taken.

6gvONxR4sf7o · on Jan 4, 2021

> If I Google “bibliotik download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

You're right. Then why do you exclude so much else that meets the same criteria?

If I Google “[insert dataset here] download” the first page of results has a link to download. Calling that “not easy to obtain” is absurd.

In your definitions, you exclude vast swaths of things meeting that criteria in a seemingly arbitrary manner, which is why it's weird.

legatus · on Jan 1, 2021

I think it's worth noting that EleutherAI is a grassroots collection of researchers, which distinguishes it from academia/industry labs.

As part of their work on democratizing AI, they're now hoping to replicate GPT-3 and release it for free (unlike OpenAI's API).

I would encourage everyone interested to join their discord server (https://discord.gg/BK2v3EJ) -- they're extremely friendly and I think it's a project worth contributing to.

odnes · on Jan 1, 2021

How are they sourcing/funding the compute to train these massive models?

stellaathena · on Jan 2, 2021

TFRC, a program that lets you borrow Google's TPUs when they're not being used. You can apply here: https://www.tensorflow.org/tfrc

Alvion_Bleeds · on Jan 2, 2021

Connor Leahy, who I think is a sort of BDFL figure for ElutherAI, mentioned in a Slatestarcodex online meetup I attended that Google donated millions of dollars worth of preemptable TPU credits to the project. There is a video of the meetup on YouTube somewhere. Struck me as a really smart kid with a lot of passion.

chillee · on Jan 2, 2021

Haha Connor (although one of the main participants) definitely isn't a BDFL - we don't have any BDFLs :)

We don't really have much of a hierarchy at all - it's mostly just a collection of researchers of widely varying backgrounds all interested in ML research.

stellaathena · on Jan 2, 2021

I'm not sure what a BDFL figure is, but Google does not give us millions of dollars. We are a part of TFRC, a program where researchers and non-profits can borrow TPUs when they're not being used. You could say that we are indirectly funded as a result, but it's nowhere near millions of dollars and it doesn't reflect any kind of special relationship with Google.

o-__-o · on Jan 2, 2021

Benevolent dictator for life

leogao · on Jan 2, 2021

EleutherAI has a very flat hierarchy; we do not have any BDFL-like figure.

luto · on Jan 1, 2021

they'll probably run it on scientific clusters of various universities, or on collections of idle lab desktop machines. Both of these tend to sit idle a lot of the time, based on my experience at uni in Europe.

joe_the_user · on Jan 1, 2021

Any idea how large dataset used to train GPT-3 was?

arugulum · on Jan 1, 2021

570GB of Common Crawl post-filtering, but only 40% of CC data was seen even once during training, though CC is only 60% of the training data. You could work through the math to find the rough size of GPT-3's training data, but it sounds like The Pile is of comparable size.

leogao · on Jan 2, 2021

Yeah, the Pile is approximately the size of the GPT-3 training data, which is not a coincidence--one major reason we created the Pile (though certainly not the only one) was for our GPT-3 replication project.

forgingahead · on Jan 2, 2021

This is a great effort, and it's important to have datasets like this available to democratise ML learning and work.

One small comment: It would be great for this (and other) datasets to give a quick "sample data" file - preferably one that doesn't need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn't practical.

remexre · on Jan 2, 2021

The test and validation datasets are under half a gig; I agree it'd be nice if there were a few example items and perhaps a JSON Schema [0] for it, though.

I'll reproduce a line here:

    {"text": "Roman Catholic Diocese of Tambacounda\n\nThe Roman Catholic Diocese of Tambacounda () is a diocese located in the city of Tambacounda in the Ecclesiastical province of Dakar in Senegal.\n\nHistory\n August 13, 1970: Established as Apostolic Prefecture of Tambacounda from the Diocese of Kaolack and Diocese of Saint-Louis du S\u00e9n\u00e9gal\n April 17, 1989: Promoted as Diocese of Tambacounda\n\nSpecial churches\n The cathedral is Cath\u00e9drale Marie Reine de l\u2019Univers in Tambacounda, which is located in the Medina Coura neighborhood of the town.\n\nLeadership\n Bishops of Tambacounda (Roman rite)\n Bishop Jean-No\u00ebl Diouf (since 1989.04.17)\n Prefects Apostolic of Tambacounda (Roman rite) \n Fr. Cl\u00e9ment Cailleau, C.S.Sp. (1970.08.13 \u2013 1986.04.24)\n\nSee also\nRoman Catholicism in Senegal\n\nReferences\n\nExternal links\n GCatholic.org\n Catholic Hierarchy \n\nCategory:Roman Catholic dioceses in Senegal\nCategory:Tambacounda\nCategory:Christian organizations established in 1970\nCategory:Roman Catholic dioceses and prelatures established in the 20th century", "meta": {"pile_set_name": "Wikipedia (en)"}}

[0]: https://json-schema.org/

gwern · on Jan 2, 2021

I believe the paper includes random samples at the end.

mrconter1 · on Jan 3, 2021

It includes two example snippets from each dataset.

w1nk · on Jan 2, 2021

Will this dataset be able to produce a model that's actually better than GPT-3? One of the things mentioned in the paper is that the dataset had some buggy filtering applied to it (https://arxiv.org/pdf/2005.14165.pdf, page 9), that minimally impacted benchmarking some, there's a whole section on how they tried to deal with it. The gist though is that a cleaner training run, even on slightly more data, may (should?) actually produce something a bit better.

Does anyone know if OpenAI has retrained/updated gpt-3 yet?

thereticent · on Jan 1, 2021

Love the name. It's what we called our shared FTP-served collection of mp3s during college 15-19 years ago. Yours is a MUCH more impressive amount of information!

leogao · on Jan 1, 2021

Twitter thread: https://twitter.com/nabla_theta/status/1345130412532645888

eutectic · on Jan 1, 2021

I'm kind of surprised there's 800GB of text in the world.

pbourke · on Jan 1, 2021

It’s not that much, actually. The average English word is 4.7 letters long. Let’s round up to 5 and add 1 for a space to make 6 characters. Novels are around 90,000 words long. So 800G of pure text represents 800G / 6 char per word / 90,000 words per book =~ 1.5M books. The Library of Congress has 39M books, not to mention all the text produced that’s exclusively online.

d33 · on Jan 2, 2021

That makes me wonder what's the upper bound of storage required to contain all of the text humans had ever written. 1k times as much? 10k? Either way, I have a feeling that all of the drives would probably fit in a regular apartment room.

Crazy how we went from first computers spanning across entire floors to fitting all of human thoughts in such tight space.

Aerroon · on Jan 2, 2021

The entire Library of Congress according to the earlier calculations would fit into 21 TB. There's already a 100 TB SSD that's the size of a 3.5" HDD.[0] 10000 times the Library of Congress would fit onto 2100 of these drives.

The dimensions for this 3.5" drive are 26.1 x 147 x 101.8 mm.[1] That's a volume of roughly 391 cm^3. 2100 of them take up 821,100 cm^3.

A Fractal Design Define 7 has the dimensions of 547 x 240 x 475 mm. That's a volume of 62,358 cm^3.[2]

You could fit 2100 of those SSDs into the same volume as about 13.2 Fractal Design Define 7 cases. That's roughly the size of a bookshelf. You could fit the books of 10000 Libraries of Congress onto one bookshelf.

Now imagine using tape for this. Edit: I'm actually not sure whether tape has better density.

[0] https://www.techradar.com/news/at-100tb-the-worlds-biggest-s...

[1] https://www.newegg.com/nimbus-data-dc-100tb/p/2U3-002M-00004

[2] https://www.tomshardware.com/reviews/fractal-design-define-7

stellaathena · on Jan 2, 2021

There's more than 800GB of fanfiction in the world.

eutectic · on Jan 2, 2021

Well, that's truly scary!

barkingcat · on Jan 2, 2021

Not scary, titillating.

tens of G's for each desire.

leogao · on Jan 2, 2021

Oh, there's a lot more... no promises atm, but we plan on going much bigger for our next dataset.

jan_Inkepa · on Jan 2, 2021

Cool project! Minor note: It would be nice if ye said what language the dataset was for on the website :)

mrconter1 · on Jan 3, 2021

Would you mind sharing some hints? What new approaches are you taking compared to this one?

lovelearning · on Jan 2, 2021

It's not that big even as a text dataset. Common Crawl weighs in at some 250+TB. Even if we assume just 1% of that web data is usable text (it's likely much more), it's still 2.5+TB.

https://en.wikipedia.org/wiki/Common_Crawl

jcims · on Jan 2, 2021

I feel like YouTube is going to be a major source of language data in the future. The last statistic I saw was 500 hours of video uploaded every minute. If only 10% of those videos have original speaking in them and those average 40 words a minute, that’s almost 300 GB of transcribed speech per year.

wongarsu · on Jan 2, 2021

YouTube would be a great source for spoken language. But only a tiny portion of YouTube has subtitles, and it doesn't yet feel like automatic transcription is at a level where you would want to use its output to train something else. That day will surely come though

luto · on Jan 1, 2021

heh, yeah! for comparison: the English Wikipedia is around 40 GB of text.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Si...

lallysingh · on Jan 2, 2021

I wonder when this will be a benchmark number for a desktop GPU/NPU...

bratao · on Jan 1, 2021

I´m super excited by this dataset. The EleuterAI team is stellar and many great things are coming soon from them!

pizza · on Jan 1, 2021

Related: is it possible for me to download/try out a pretrained GPT-Neo?

stellaathena · on Jan 1, 2021

Not currently. The code is on GitHub, but we do not have a public-facing model. We felt that the world doesn't need another GPT-2 replica, but if there's interest we can look into doing so though. We are planning on making our GPT-3 replica public facing though.

https://github.com/EleutherAI/gpt-neo

floatingatoll · on Jan 2, 2021

To clarify, they mean diverse as in “unconnected datasets”.

durnygbur · on Jan 2, 2021

Someone explain me how this will not stir the wrath of copyright echelons, especially American ones? Or once I get PhD I can torrent movies again?

unixhero · on Jan 2, 2021

How can I start working on this dataset? Are there any Python GPT libraries?

mrfusion · on Jan 3, 2021

Why not throw Wikipedia into this collection?

ValueNull · on Jan 3, 2021

Hopefully not the Scots version

stellaathena · on Jan 3, 2021

... is it in the collection?