OpenAI releases larger GPT-2 model

minimaxir · on May 5, 2019

Notably, the 345M model (1.5 GB on disk) is big enough that it's pushing the limits of conventional GPUs, and an alternative method of finetuning the model (https://github.com/nshepperd/gpt-2/commit/47df6da611716b4826...) has to be used to prevent the GPU from going OOM.

I'm working on tools to streamline GPT-2 text generation: I'm currently porting the code above to gpt-2-simple (https://github.com/minimaxir/gpt-2-simple) to allow easy finetuning/generation, and am also working on a way to quickly build an API/client for easily deploying GPT-2 to production and generating text at scale and cost effectively. Even the 117M model, managing CPU and RAM performance is tricky.

But given the incredible results of just the 117M model (e.g. Hacker News titles from a retrained 117M model: https://github.com/minimaxir/hacker-news-gpt-2), I'm eager to put the 345M model through its paces.

jchw · on May 5, 2019

>(e.g. Hacker News titles from a retrained 117M model: https://github.com/minimaxir/hacker-news-gpt-2)

Wow, thats great. “The Bullshit Bubble” “Fuck you, Bootstrap” “We should give up on America” - they’re practically comedy, yet very believable too.

Judgmentality · on May 5, 2019

"How to Make a T-Shirt in 10,000 hours" is my favorite.

nicwilson · on May 5, 2019

I think its just edged by "Why I'm ditching ebike hoses for barometers" for me

Though "Hacker makes an infinitely scaleable cup of coffee" gives it a good run for its money.

mkl · on May 5, 2019

I had to do a search to make sure that or something analogous wasn't actually real! It seems incredibly on target.

tyre · on May 5, 2019

This is great. A possible hackathon project for writers. Everyone gets a title and has to write the post

qlm · on May 5, 2019

My favourite is "The secret Nazi medical charity giving out freezers".

Very reminiscent of headlines from The Day Today: https://youtu.be/wdEcO8_2Kl8

zrm · on May 5, 2019

Has anybody tried feeding it comedy to begin with to see what it spits back out?

minimaxir · on May 5, 2019

There was an attempt on Reddit but it didn't turn out well, likely because there wasn't enough input data: https://www.reddit.com/r/MachineLearning/comments/bgvzdu/d_j...

(I have an idea for a more proper approach)

CYHSM · on May 15, 2019

I tried it on The Office quotes [1] and it seems to generate some more or less humorous examples

[1] https://www.reddit.com/r/MachineLearning/comments/bmn0og/p_l...

ddalex · on May 5, 2019

I can provide more details next week once the article is out, but I've fed it a corpus of one liners and it started to produce weird and sometimes very funny liners :)

nprateem · on May 5, 2019

It'd probably produce a crap sitcom. But then again so does the BBC these days.

minimaxir · on May 5, 2019

Update: 345M model support successfully added to gpt-2-simple and new release pushed to PyPI, plus I've updated the corresponding Colaboratory Notebook accordingly (https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN...)

At a glance it's hard to see if the 345M model is "better" at the moment (it's all qualitative), but I'll be doing more testing. Unfortunately, the 345M model might be slightly too resource intensive for the API/client use case I wanted to do, so I'll likely be sticking with the 117M model for now.

applecrazy · on May 5, 2019

One example from your dataset that gave me a laugh:

Hiring technical debt (or "unsortable overtime")

How do you hire technical debt??

Edit: another one-

“In 2009, Africa power creation was switched on for the Google Earth Darth Vader Imperial Warplane Propaganda”

I guess Google is diversifying :P

dannyw · on May 5, 2019

Outsource a project to a contracting agency and you’ve hired technical debt.

antpls · on May 5, 2019

I guess it went through that path :

-Hiring -> Software Developer

-Software Developer -> Code

-Code -> Technical Debt

You only have to hire a Software Developer :)

de_watcher · on May 6, 2019

It has found a shortcut.

M4v3R · on May 5, 2019

> Show HN: Blogia, a blogging framework for Ruby

Wow, this sounded so believable I had to check if this exists or not. It seems it doesn't exist, yet :).

burtonator · on May 5, 2019

This thing really does scare me and fascinate me at the same time.

There's so many nefarious things you could do with this...

Generate fake news. spam google. etc.

One valuable use could be to generate comedy and parody.

You could also make it to sabotage others too. You could set it lose on nazi forums and have them argue with bots constantly.

lopmotr · on May 5, 2019

I think people are over-worried about fake material. We've been able to say lies since we became human. We've been able to print lies since printing. We can already edit photos. People routinely make fake screenshots of messages to defame others. Newspapers sometimes print fake stories based on lies given by their sources or which they fail to vet. But provenance still exists to prove things, not how convincing it looks.

AJ007 · on May 5, 2019

The problem is going to be the volume, and how long it takes the reader to figure out what is going on. You can look at a photo and quickly realize something is wrong. With this, you may need to read a few paragraphs or even start reading through what else the "author" has written. That is effectively a denial-of-service attack on human comprehension.

Consider the issue of fake reviews: sometimes fake reviews are really obvious. Sometimes they aren't. Often the best way to pick out the fakes is to analyze all of the other reviews the user has written. That is going to become harder.

For very technical topics, where the reader comes with a strong background knowledge in that topic, picking out the fake material isn't too difficult. I suspect for the hazier things where the writers are more or less stating opinions, like politics, it is going to be incredibly difficult (for a human reader) to separate the bots from real people.

ben_w · on May 5, 2019

If evidence alone was enough, there wouldn’t be any debate about antivaxxers, climate change, the Iraq WMDs, the 9/11 and moon landing conspiracies, Holocaust denial, the correct way to bring peace to the Israel/Palestine conflict, the safety of being white in a Muslim community or being black around an American cop, Brexit, any of the Assange drama, or if Snowden was acting appropriately by leaking what he leaked. And those are just the disagreements I’ve witnessed.

Just because lies and propaganda have always been around, it doesn’t mean they were never a problem.

jfoster · on May 5, 2019

To take an optimistic view, it may eventually teach people to more critically evaluate anything they read.

narrator · on May 5, 2019

If you wanted to run a propaganda department though, you needed to be a government or a large news organization. Now it will be within the reach of 4chan.

nprateem · on May 5, 2019

True. Authors at The Onion should be afraid.

leod · on May 5, 2019

I haven't heard of gradient checkpointing yet, thank you for the link! Do you know how it compares to gradient accumulation? The latter basically reduces the batch size, but takes the sum of multiple gradients before actually performing an update, thereby having the same effect as the original batch size.

The generated titles are great! You can put them into hncynic (https://github.com/leod/hncynic) to get closer to a fully generated HN experience.

gwern · on May 5, 2019

Gradient accumulation and gradient checkpointing are orthogonal. You might want to use them simultaneously.

If I had to compare them, I'd say that accumulation is about working on a minibatch datapoint by datapoint and faking being able to run an entire large minibatch in a single shot, while checkpointing is about working on a model layer by layer and faking being able to run an entire model in a single shot.

The problem with GPT-2-335M and why nshepperd had to mess with gradient checkpointing is that the GPT-2-335M model will literally not fit in your standard 11GB GPU (and from Twitter comments about people trying it on the new 16GB Google Colab instances, it's unclear if 16GB would be enough either!). You can't even run minibatch n=1. It doesn't fit. It OOMs.

The model itself is only a gigabyte or so, the problem is that the self-attention layers, when run, use up a huge amount of memory for their intermediate steps, which must be stored in order to trace everything backwards through each step for the backprop part of training.

(Right now I believe nshepperd's code punts on doing gradient accumulation simultaneous with gradient checkpointing, so we've just been reducing the learning rate, which is sort of similar to faking large minibatches with gradient accumulation.)

Fortunately, because the self-attention layers are so small and cheap to compute, they work well with gradient checkpointing. They're cheap to recompute on the fly, so it's more important to save memory and allow training at all. (This is also how OpenAI is training the Sparse Transformers which are enormous; they haven't said either way, but I assume this is how they trained the larger GPT-2s like the 1.5b parameter version, because I can't imagine what hardware would fit even a single GPT-2 1.5b without tricks.)

leod · on May 5, 2019

Thank you so much for your comprehensive answer, this helps a lot.

If I understand nshepperd's code correctly, it uses a constant and small learning rate. Do you know if this works better than the learning rate schedule that is usually used for Transformer models (https://www.tensorflow.org/alpha/tutorials/text/transformer_...)?

gwern · on May 5, 2019

It's a constant, yes. We haven't tried any other learning rate schedules (for my poetry GPT-2s, I simply drop the LR 10x each day or so). I have no idea if this is optimal for transfer learning or not.

indalo · on May 5, 2019

wow! I just had a blast putting titles into that. the results are amazing. kudos!

yeldarb · on May 5, 2019

Can you elaborate on the expected hardware requirements? (Both for fine tuning and inference)

bazza451 · on May 5, 2019

This is really cool.

Just to pick your brain (I’m an ML noob and it triggered an idea in my head). I was wondering if GPT-2 could be used to generate small paraphrases for an input sentence for search suggestions? e.g “I’m going to the mall today” -> “I will be going to mall today”.

czr · on May 5, 2019

It seems to train fine on Colab without that flag, so that's a decent fallback for anyone running into memory issues on a local machine.

minimaxir · on May 5, 2019

If you did a fresh pull of that fork using a Colab notebook that does so, it will set the flag automatically if you specify the 345M model (I will follow that behavior in my own package).

If it didn't set the flag, it gets messy: https://twitter.com/minimaxir/status/1124742105421631488

ehsankia · on May 5, 2019

Worth noting that the new 345M model is still far from the full 1.5B model they were avoiding to post. The headline makes it seem like they finally decided to give the full model, but it's just a slightly larger demo model.

krick · on May 5, 2019

It strikes me as weird that they are not publishing it, by the way. According to their rhetoric when they started, it was the whole purpose of OpenAI: acknwoledging that we are at point when anybody with enough resources can produce something, let's say, interesting with ML, and striving to give everyone more or less equal possibilities by serving as a more effective academic organization for the world, before Facebook or Google takes over the world completely.

Plus, it's not as if this thing is more "potentially harmful" than, well... basically anything of use, like electricity, internet, fire, less perfect language models. In fact, it isn't even anything new, it's just (possibly) less broken language model than what we already have.

Admittedly, it would be quite problematic to use the full model with today's mainstream GPUs, so I'm not that much saddened by them hoarding on it. It just seems curious to me.

imranq · on May 5, 2019

While I appreciate that this is a large advance, I’m worried that these releases will make the internet completely worthless after some time. If AI can come up with fake news, fake text, fake videos and pretty much anything the user wants it to, then we will be flooded with biased content that’s untrustable. There’s probably some critical percentage of AI generated content on the web that guarantees this happening. (I’m guessing it’s around 40%)

SequoiaHope · on May 5, 2019

Their release strategy is to provide lower quality models to the public while giving research partners access to the full models. The goal of this approach is to let researchers devise methods of detecting and counteracting this new technology. It’s kind of like “this technology is going to exist so we need to prepare responsibly.”

disconcision · on May 5, 2019

I've seen a lot of posts along these lines but I'm unclear as to what specific scenarios this technology precipitates. Like what, concretely, is the concern? There's already a lot of bad content online, and anyone who cares about information quality already relies on filteration through human editors. Like I can buy the idea in principle that adding orders of magnitude more noise in the system might fundamentally distabilize it... Again. But it's really not clear to me?

Felz · on May 5, 2019

Oh wow, I was literally just making a toy Discord bot for GPT-2. Guess I'll update it with the bigger model.

EDIT: Done! It takes about 3x longer than it did before to generate a response, so if you try it be very patient. Also, I made this in three hours so I wouldn't be surprised if it goes up in flames at some point.

https://github.com/ScottPeterJohnson/gpt2-discord

p1esk · on May 5, 2019

How long does it take, and which hardware do you use?

Felz · on May 5, 2019

About 30 seconds for a smaller response on an EC2 instance. Mind you this is without a GPU because I couldn't figure out how to set one up.

opdahl · on May 5, 2019

I tried adding it but I get the error message "This bot is private. Only the bot owner can add it."

Felz · on May 5, 2019

Oh, sorry. I forgot to mark it as public. Feel free to try again.

opdahl · on May 5, 2019

Yeh it works now!

Tried with this input: "what is your purpose as a bot?" Response: "There are a handful of examples available out there that seem fairly plausible as a good initial goal. These examples might get you started: A bot with a purpose to collect information (and I'm going to assume that you have to have an intent, so I'll be using that term and not the verb for my example). Or a bot with a purpose to help out a customer that wants something delivered to him/her by mistake. Or a bot that wants to promote your site to the entire universe of bots. For example, here in New Zealand, that's a fairly common use of botnets as far as I can see: How do you know I have enough information to set it up? Let's say I have an account with the website at /r/frugal, what if you are already running bots that want what I need? How do you know these bots would actually be helpful? One answer I see is"

Pretty cool stuff, but I see your description as "semi-coherent gibberish" fits really well. I could see something like this get pretty interesting in the future, where you, for instance, could maybe populate a discord server or twitter chat with semi plausible noise to fake popularity.

steve_g · on May 5, 2019

Maybe a dumb question - how does the model which is trained to predict the next words answer questions, as shown in the reading comprehension example? Do you just feed it the question and watch it generate the answer, or is something else going on?

nestorD · on May 5, 2019

If I remember correctly, they say that since the training set contains extracts of question-answer sessions, it will detect the pattern and follow it when you give an appropriate prompt. So yes, you just feed the question and, detecting that it is a question, it answers.

dhairya · on May 5, 2019

you add a linear classifier at the top to predict start and end positions of the answer span. The augmented model is trained on a qa dataset like squad to actually learn how to answer questions.

hugging face has a simple implementation that augments bert in this manner and you can see the code there. their bertqa model get like an 84 F1 on squad 1.1 which really strong performance. you can augment the thier gpt2 implementation similarly.

p1esk · on May 5, 2019

I think they used gpt-2 for qa without any finetuning.

zcw100 · on May 6, 2019

People should be equally astonished by what this model can create as they should be by what vacuous, insipid garbage passes for writing these days. Some of the worst offenders being corporate CEOs spewing bullshit while people sift through the verbal excrement trying to figure out what they had for breakfast at 4:00 hoping it will make them rich too.

burtonator · on May 5, 2019

If you guys want some background check out this blog post:

https://openai.com/blog/better-language-models/

It's fascinating to think an AI wrote that story...

mkl · on May 5, 2019

That's actually the same page as the main link (which is scrolled down to the update).

kyledrake · on May 5, 2019

> Due to our concerns about malicious applications of the technology, we are not releasing the trained model.

Has anyone said when they are intending to release the full model? There's likely a lot of positive applications of this technology as well.

p1esk · on May 5, 2019

They said within 6 months, however by that time there will likely be something better.

marvin · on May 5, 2019

Releasing publicly AI tech that is six months behind the state of the art, might be a reasonable compromise against the fear of nefarious use. That way, more people know that the tech exists before it's available to everyone, and get a 6-month window to adapt.

At the same time, the risk is lower that a different team develops similar capabilities and widely deploy it without anyone even knowing that what they do is possible.

Wouldn't be surprised if this pattern is followed even more in the future. It's an acknowledgement that there are many angry cats that will (almost unstoppably) be let out of the bag in the next decades, but attempting at releasing the cats in the most responsible manner possible.

speedplane · on May 5, 2019

> They said within 6 months, however by that time there will likely be something better.

There has. BERT has come out and is better.

cthalupa · on May 5, 2019

Erm, isn't a big part of this whole story that GPT-2 > BERT -> GPT?

gwern · on May 5, 2019

BERT is bidirectional. How do you use that for language generation?

visarga · on May 5, 2019

It will generate words for every [UNK] in its input sequence.

gwern · on May 6, 2019

It'll generate one token because it's trained to predict one missing UNK, as I understood it. What is the scaffolding? Do you generate random sentences and iterate repeatedly? And how does that get you whole coherent paragraphs? (Has anyone demonstrated that this actually works with BERT?)

speedplane · on May 10, 2019

BERT can pretty easily be used to generate text. It's intended to be used as a base model and fine-tuned with an additional model on top. The fine-tuning model could then be trained to generate sentences with the underlying language model powered by BERT.

snrji · on May 5, 2019

Tangentially, is there any estimation of the amount of words/sentences that a kid may have heard by the time he or she learns to talk?

rand_r · on May 5, 2019

Maybe I’m too naive here, but I’m not seeing the potential malicious usage of this model. People will generate text, and then what?

lukebechtel · on May 5, 2019

Fiction, but this might be a good place to start: https://slatestarcodex.com/2018/10/30/sort-by-controversial/

nestorD · on May 5, 2019

Especially what kind of usage that could not already be achieved by asking a human to write a text.

cthalupa · on May 5, 2019

It makes it cheaper, faster, and effectively infinitely scalable. You'll run out of man hours and people to generate this stuff by hand than you can launching a ton of resources in the cloud.

With the main concerns being troll army/fake news type stuff, I don't think this makes a difference. We seem pretty sure there are state level actors behind a lot of that stuff, and I think it would be silly to believe they can't recreate something at the level of GPT-2, especially with the underlying principles out there and understood, competitors like BERT available, etc.

I think their heart is in the right place, but also incredibly naive.

Geee · on May 5, 2019

It will accelerate the development of a fact checking AI, which means lying and manipulating people won't be so profitable any more, I guess.

rytill · on May 5, 2019

That ain’t it.

notsgnik · on May 5, 2019

in other words: "we can't release our finding before retraining it on filtered Reddit data-set, cause we haven't released our first finding for a reason... stay tune for more trained model with filtered content" ( what's the fuzz about the first fully trained model?" )

septor · on May 5, 2019

Here’s a summary: we are aware of the fact that this model will harm society and we are releasing it anyway. We are fiddling around with the way it’s released in an attempt to absolve ourselves of blame while simultaneously collecting the profit in the form of a juicy acquisition.

The net result of these advanced forms of signal processing will be negative. Nobody has come forward to prove that they will benefit society on the whole or even that they are safe. But anyone who raises concern is shouted down and called names like “alarmist” and “Luddite.”

These companies are playing with fire, and the whole world stands to be burned. Wake the fuck up.

feanaro · on May 5, 2019

Someone is going to invent this model sooner or later, simply because it is possible. There is not much sense in trying to stop it. We just have to adapt.

septor · on May 5, 2019

That’s not correct. What you are saying is that there is no plausible organized effort that could stop or slow the creation of signal processing models that will have pronounced negative impacts. The error is on two levels: you are using too much analogy with other technologies. And you are writing off the possibility of stopping ai when it’s still not clear that it can’t be stopped.

This isn’t something that can be built and tested in isolation like other things we are familiar with. Training these models is not an exact science. Nothing about ai is an exact science. Progress only comes with trial and error. And each trial requires huge compute resources; at least for the most capable and dangerous models. It can’t be done in your basement. Not without significant effort and drawing attention to yourself. Could we sense whenever someone was trying to do it? Could we form a global coalition to stop every attempt? That brings us to the next thing.

What you are doing is the following: we are both in a car that is about to roll off a cliff. I propose that we try pressing the brakes. You respond by saying that, geez it looks like we probably wouldn’t stop in time — we are going awfully fast and it probably wouldn’t work to press the brakes so why even try? Let’s just brace our heads and hope the impact doesn’t kill us.

Obviously the better thing to do is to try and press the brakes. Even if you aren’t sure if you can stop in time.

jcims · on May 5, 2019

I think if you try to be less cynical about what OpenAI is doing you might even find them an ally to your perspective. Facebook throwing pocket change at an AI ethics org and Google's rather embarrassing failure at staffing an ethics panel of its own is evidence that we've got bicycle brakes on a freight train.

OpenAI suffered a ton of blowback for not just releasing the full model from the start. You can read their intitial blog post [1] looking particularly at the sections for Policy Implications and Release Strategy. I would also highly recommend you listen to Lex Fridman's podcast with Greg Brockman [2] to hear their rationale about the recent org changes at Open AI.

Obviously you can posit that everything they say is bullshit and they are only after almighty dollars. I can't prove it's not true and personally believe there is at least a kernel of truth to it, but we live in a messy world and finding imperfect allies is generally better than having none at all.

[1] - https://openai.com/blog/better-language-models/

[2] - https://www.youtube.com/watch?v=bIrEM2FbOLU

PostOnce · on May 5, 2019

What do you mean "Could we sense when someone was trying to do it", are we talking mandatory computer inspections, government-enforced walled gardens, and the death of the general purpose computer? lol, if you thought gun control was hard...

Also, yeah, you could do it in your basement without being detected.

You can do it in the cloud, too, a lot of us here have the skills and resources to do it, but why spend time on this toy instead of another one, especially if it's going to cost money we could spend on something more fun?

feanaro · on May 5, 2019

> The error is on two levels: you are using too much analogy with other technologies. And you are writing off the possibility of stopping ai when it’s still not clear that it can’t be stopped.

No. What I am saying is that it's impossible to centrally control the actions of 7.7 billion free humans. A lot of them will disagree with your position (and any other position as well).

By trying to "put a stop to it" in a central manner, you are only making it harder (but not impossible) for some subset to learn about this phenomenon, to improve on it and to understand its strengths and weaknesses.

I am unconvinced by your argument that we could reliably detect, much less stop, attempts to train a large and useful machine learning model.

CamperBob2 · on May 5, 2019

Correction: You are in a car that doesn't have any brakes.

krick · on May 5, 2019

I don't understand why it's downvoted (at the moment of writing, at least). I think it's absolutely correct.

Imagine a thing "we should not do", let it be creation of really powerful language model, genetic engineering in order to produce smarter, stronger children, or whatever. Whatever you are opposing to, really. The simple fact is that if you (and by "you" I mean any entity you associate yourself with, be it literally you, or your company, or a group of researchers in your country, or the government of your country) can do something, anybody can. Even if it will be a couple of years later. You are not unique, nor alone. You can stop "yourselves" as long as you want, but there are other people, companies, research groups, governments, and they don't give a fuck about what you think "we" should do.

So, in the end, the car really doesn't have any brakes. Even if it truly means the end of it all, it's just unfortunate, but really, really unavoidable.

tensor · on May 5, 2019

Nobody has proven these technologies will be a net negative either, there are many positive applications.

>These companies are playing with fire, and the whole world stands to be burned. Wake the fuck up.

Your tone and arrogance is very unwelcome.

msvan · on May 5, 2019

> of a juicy acquisition

OpenAI is a non-profit. They are not looking to get acquired.

FartyMcFarter · on May 5, 2019

Is it still a non-profit? I think they changed their structure to some fudged thing recently.

gwern · on May 5, 2019

Yes, but the idea of their for-profit subsidiary is to develop new stuff, not sell old stuff. (It'd be like the Hershey Foundation selling Hershey Chocolate.) No one is making >100x returns off of spinning off a 'GPT-2 startup', either. GPT-2 isn't that far in advance of everything else; there's already a replication of the WebText corpus they trained GPT-2 on.

bufferoverflow · on May 5, 2019

It's the whole premise of Open AI - they democratize AI, give everyone access. Since it's known to be possible, someone else would be able to repeat it anyway.

Saying hyperbolic things like "the whole world stands to be burned" is silly. They aren't giving everyone a nuke.

Things like deep fakes and synthesized speech are much much worse, since you can make any politician say anything you want, and an average person wouldn't be able to tell it's fake.