Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI is Using Reddit to Teach An Artificial Intelligence How to Speak (futurism.com)
260 points by niccolop on Oct 11, 2016 | hide | past | favorite | 192 comments



Back in 2007, mobile phones used a system called T9 from Nuance corp which was trained on a word corpus taken from IRC and similar chats. This caused all kinds of issues - the mobile phones would accept offensive words like "naziparking" but reject normal language like "world peace". Using reddit may lead to ... surprises.

Source: http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-mobi...

Translated by Google: https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=...


T9 would autocomplete my name to "Asian Lung," which was hilarious to my high school friends. And to be fair, I am Asian and I do have lungs.


To be fair, most people are Asian and have lungs.


At first, I read your comment and wondered if "most" applies to situations in which a plurality but not a majority was being talked about.

Then I remembered this [0].

Still wonder about the plurality/majority thing, though.

[0] http://imgur.com/CK6aONG


As with most things in language, it appears the interpretation of "most" is dependent on the context.

https://english.stackexchange.com/questions/55920/is-most-eq...


Well look at the big lungs on Brad


I had gotten the impression that happened because they were using an algorithmic compression engine that favored false positives over false negatives, i.e. they had essentially trained an algorithm to generate "strings of characters looking like words" as a means of compression and some weird strings of characters were determined by the algorithm to be words even though they shouldn't have been.


I'm having a hard time imagining being sensitive enough that "naziparking" is offensive. Is any mention of the concept of nazis verboten?


No, "grammar nazi" is considered inoffensive. Many people pernickety about grammar use that term to refer to themselves.


Might have been an exclusively German language problem, or maybe they cleaned it up since that article was published. I have a phone with T9 and it works remarkably well. Heck, I would prefer it to Swype. (Of course, it's not useful on smartphones, because they no longer have real keys.)


t9 was AMAZING and i miss it every day. faster and more accurate than swype for sure, especially since inputs for t9 are deterministic.

very easy to text even from your pocket etc - something you cant do on a smartphone - notice texting accidents became more common with the widespread adaptation of smartphones ;)


Note that you're comparing a touchscreen to tactile keys in that statement too, though. I could text from my pocket without T9, but try doing that with a touchscreen WITH T9...


And while we're doing that touch vs. tactile comparison, what I also loved about the old "dumb phones" was deterministic timing. I could do stuff on those phones from start to end without looking at the screen - I quickly learned and remembered that e.g. this operation completed instantly, and that menu has ~1/4 of loading time, etc.

I think this feature was actually key to being able to memorize how to perform operations quickly. There's no way you can do that with modern smartphone, where every other interaction lags for anything between 10 and 1000 ms, and inputs are sometimes dropped at random. It's this non-determinism of smartphone UI that makes me look at the screen all the time when using it.


I think this is actually because of the changing interface. It doesn't matter how long the interface takes to be ready if the keystrokes are buffered (which they were), so you could press "menu down down down ok" and the thing would work, even if the menu took two seconds to show up.

You can't do that when you have to actually wait for the thing you want to select to show, otherwise you'll be selecting something completely different.


I am really hoping that we reach peak touchscreen soon, and companies can start exploring other form factors again. I want a fast CPU and wifi in something that isn't just a giant flat slab of glass.


Haven't we already?


Texting also became more common with the widespread adoption of smartphones - we'd of course see more accidents as a result, but the medium became massively more useful to a lot more people.


Yes, give me some bloody buttons on my phone.

I guess I'll need to wait until shape changing tactile surfaces come to phone screens, which might be a while / never.


Side note, "naziparking" is a word people say....?!?


Welcome to IRC.

Anyways, i could see someone use that term in frustration about a car having been parked across two spaces.


So maybe the "naziparkers" are those who go around keying the cars that are parked wrong.


I assumed it was a reference to domain parking/squatting.


I guess i got the alternate impression because i read the original Swedish article text...


That's weird. A grammar nazi is a stickler for grammar. You'd think naziparking would be a stickler for proper parking.


We used T9's since at least the beginning of the Millennium. It was one of the major selling points of Nokia phones back when snake was the other one.



T9 on phones goes all the way back to ~2000


The Reddit comment corpus is an awesome dataset. There's relatively little mark-up to scrub out, low duplication, good metadata, and a variety of topics.

We used it to train a syntax-enriched word2vec model. Write up and demo: https://explosion.ai/blog/sense2vec-with-spacy

Btw, the above was run on CPU in a couple of days, because spaCy doesn't use GPUs yet. I've applied for a grant from NVidia so I can fix that. If anyone from NVidia is reading, email me? :)


  > I've applied for a grant from NVidia so I can fix that.
A g2.xlarge is 65c/hour on AWS, FWIW.

https://aws.amazon.com/ec2/pricing/on-demand/


That's honestly no way to get anything done...

Unless you're replicating someone else's thing exactly, You can't really get by with one training process. You want to be trying different things, and running a few samples of each configuration to account for random variation. I'm not even talking about decadent hyper-parameter sweeps to fine-tune. I'm talking like, how wide do I need my layers to be, what optimizers are good, how deep should I make the network, etc.

I want to be training 5-10 models at a time minimum. 20-30 would be much more productive. If I can only train one model at once, it's not really worth the effort — it's better to work on one of the other tickets for the library.


Sorry, I had the impression from the GP that you were trying to replicate a CPU-based task on a GPU.


Yeah, for OD? thats 468/month...

I'd use a spot instance and stop it whenever possible.


Spot instances are pretty painful for training. It's annoying to have the machine randomly shut down.


^ That. For all that people say about spot instances, there's no infrastructure I know if to manage jobs and have them migrate to higher priced instances without losing state.


You can always snapshot and keep track of state as you go (a little bit tricky with Spark, though). We use spot instances for training we know is not vital (as in, has to be done, but rather run it twice and save money anyway that run it for sure). Also, once you know what availability specific instances have you can always choose better (i.e. maybe c3.xlarge is slightly more expensive as spot than large, you can do with large... but xlarge has almost no shutdowns)


Presumably that's a decent chunk of what the grant is for?


> But when we ran the model on more data, and it was gone and soon forgotten. Just like Carrot Top.

"it was gone" meaning the association between Carrot Top and Kate Mara? So after better training, who is now most_similar(['Carrot_Top|PERSON'])?

EDIT: RTFA, used the interactive demo. most_similar() is now a category I would describe as "actors/comedians popular in the 90's: "Bill Murray, Gary Busey, David Spade, Charlie Sheen, Ashton Kutcher, Chris Farley".


https://demos.explosion.ai/sense2vec/?word=carrot%20top&sens...

Weirdly there's a bug that's dropped PERSON from the sense list. Fixing.

Edit: Fixed.

Edit2: Ah this is super misleading atm. I'll have a think about how to do this better. auto is case insensitive, but if you specify a sense, it's case sensitive. So you need to do "Carrot Top" and set PERSON. Btw, contrast with carrot top "NOUN".


Since it was not mentioned in the post, here's a direct link to the Reddit comment corpus likely being used: http://files.pushshift.io/reddit/comments/

The full table (up to end of 2015) is available on BigQuery, with separate tables for each month thereafter: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_p... (there is a similar table for comments)

And here's a year-old post I wrote on how to use that Reddit dataset with BigQuery: http://minimaxir.com/2015/10/reddit-bigquery/


If you want to download the set as a torrent (to save pushshift.io some bandwidth cost), you can do so via

  magnet:?xt=urn:btih:UGFLA4QNEXGEFKYYY5ZU37JIHWEEYY5R&dn=reddit_data&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2fopen.demonii.com%3a1337&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969
which I have put together. It contains the data up to April 2016.

If you want to work with this dataset on your workstation, there are some code examples in https://github.com/dewarim/reddit-data-tools


So we have all this data, but still there do not seem to be a reasonable way to search comments within Reddit...


Cost. We could have implemented comment search 8 years ago. Actually, we did implement it. But it just cost way too much to maintain the index.


Why do you not allow google to index discussion threads?

Sometimes I can remember the comments on an article, but not the article itself. Unfortunately I can't search google using comment text, because Reddit doesn't allow google to index its comments.


What are you talking about? I search reddit using google all the time !?


https://www.reddit.com/robots.txt

    Disallow: /*/comments/*?*sort=
    Disallow: /r/*/comments/*/*/c*
    Disallow: /comments/*/*/c*


Those only block the various sorts and individual comment links. The main comments pages are still searchable.

Google was getting overzealous and indexing every page hundreds of times because it would follow every link, which included every "context" link and every sort.


It works fine. I don't know if there's a way to get only comments, but if you scroll down to the link to /r/madlads it's clearly indexing them.

"site:reddit.com don't quote me"

I use this all the time to search specific subreddits for phrases in comments I remember, but I don't want to give examples or subreddits I visit.


He actually used the site: keyword! The absolute madman!


> /r/madlads

Look at this maverick!


But it does, it works fine.


A lot of times, I see an interesting quote or word on reddit, and I google to find out more, and the first (sometimes only) result is the comment itself. That even happens with comments that are less than an hour old.

Try adding site:reddit.com to your search.


Fellow QA here. I got started with BigQuery because of your post that I noticed on Reddit. Thanks!


That’s 260 gigabyte, that’s actually a tiny dataset, you can query and index that on a normal workstation in seconds.

Even training models on that is possible in realistic times on normal systems with that.


[deleted due to wrongness]


Which one is which? Thanks in advance if you know, because i'd rather not download a huge torrent just to find out I should have been using that bandwidth to download a different one.


Never mind, misread, the comment data is indeed what I linked to.

However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.


> However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.

Even then, that’s easily doable on a consumer system.

I’ll download it in the night between friday and saturday, after I install my new HDD, and just run queries over it for fun. (far slower, but also far cheaper than BigQuery. Even at German electricity prices).


How one would use such technology? Let me rephrase - how would YOU use this technology if you had it?

Imagine you have a bot that convincingly passes the Turing test - what would you do with it?

Build a chatbot business? B2C or B2B?

Sell it to one of the big companies and if yes then how much do you think it would go for?

Give it to OpenAI? Open source it? If you answer yes to any of this questions, then why?

Edit: let me qualify - this would not be AGI, just a much more advanced bot than whatever is currently on the market.


I'm going to make a prediction: Soon after the first chatbot passes a Turing Test, there will be many more to follow, they will get better and better, and the methods will be so interlinked that there will be no way to defend it as proprietary software. The data too will be open source--the public reddit dataset already has tons of value in it.

The question then is, "When everyone has access to free chatbots that can pass the turing test, what will they be used for?" The answer is "tons of stuff", and lots of people will try it at once. I think many applications will be niche.

Also, people will argue about what constitutes a Turing Test. For instance: https://twitter.com/mattdpearce/status/784162089397092352


I agree that making chatbots passing the Turing Test will probably be something so common in the future that it will be a student project of medium complexity.

However, we are not there yet. If someone had this today, how valuable would this be to Apple, Google, Facebook, Microsoft and Amazon?

As for the definition of what the Turing Test is, it's definitely a fuzzy subject. My own arbitrary definition is "ability to convince a human that he's talking to another human after a sustained (based on time or length of) conversation whereas human is aware that there's a possibility that his interlocutor maybe a machine".

So, it's more of convincing a judge in Loebner Prize competition than a random troll on Twitter.


Woah! I thought they pretty much already have passed it? Remember Ashley Madison? You had ~12 million heterosexual men (that were cheaters, 6 million 'active' users) trying to talk to ~12k heterosexual women (also cheaters, 10k 'active' accounts). It ends up being about a 1:13,000 ratio. Not only that but MANY of these men had paid actual real money to the site in order to do so, and then continued to do so. The only real conclusion was that most of the men were talking to bots that the site had made up.

Ok, lets get this straight: ~6 million real human men paid real money that they earned through their labor or whatever to talk to bots and then paid more real money to do it again. Admittedly, they are 'cheaters', but 6 million men must have an IQ distribution nearly identical to that of the general population, i.e. they represent heterosexual human males in general. And yes, they were trying to get laid, these conversations are likely pretty brief, and mammalian males are not generally known for using their neocortex during mating.

Still, I think that 'counts' as far as passing the Turing Test. Yes, now we can move the goal posts to say that the bot has to teach me something, or guess what I was thinking, or generally be better than a man on tinder. But as a first pass of the TT, I think we have been here for a few years now.

https://en.wikipedia.org/wiki/Ashley_Madison_data_breach


There's no reason 6 million non-randomly selected men would be likely to have an IQ distribution similar to the general population. You can't make up for non-random selection with a larger sample size.


Ok fine, but then how far off of the mean should they then be? They aren't all super smart nor are mentally deficient as they have to be able to function in society to make money enough to pay for the service. At most this is what, an IQ mean of 75-125. So that then means at the low end, TT chat-bots can fool human males of IQs of 75. That's pretty darn good and that was 4+ years ago.


One application I predict is spam, from nornal spam being less detectable to those spam bots that add you on chat platforms and try to get your card info, responding to what you type.


Turing test is relatively easy to pass if you creatively design your program to roleplay an irrational character. See this program https://en.wikipedia.org/wiki/Eugene_Goostman

Obviously such programs, while being works of art, are not interesting from AI/Machine Learning point of view.


We've already had a few chatbots pass a turing test. Just none that are particularly sophisticated. One in particular just acted "cheeky" to throw off the other person.


How good is it? Can it talk to me, learn what I know, and more importantly learn what I DON'T know? Can it use that information to help me learn various things?

Online lectures are great, but a personalized tutor could change things. If I restate back my understanding of a subject, and it clearly tells me why I'm mistaken, that's useful.

Reddit does this, kind of, today... but it's not really from an informed position. It's mostly uninformed people arguing with equally uninformed people. There are gems occasionally, but it's rare. That's why /r/depthhub was created


If a bot could talk to me, and read Wikipedia, and figure out how to get me from the place I am in understanding a topic to where the Wikipedia explanation is, that would be crazy amazing. I don't even know if this would be particularly intractable at this point... the figuring out where I am currently at in understanding would probably be the difficult part.


It can learn from you and from other people, plus it has a vast internal knowledge base so it will know something that you don't know.

While it can answer your questions and check your understanding (it's a First Order Logic application), I haven't thought of the educational applications but I don't see why not... Thank you for bringing this up.


I would use it to pre-process Github issues for me. Attempt to reduce some random user's rambling to a clear set of repro instructions, prompt for more information from the user when necessary, ping other devs' chatbots for help. Basically, issue templates, taken to the next level.

Also, I would use it to troll bureaucrats when I give up in frustration. It should try to force the bureaucrat to admit flat out that their logic is fundamentally flawed, and then ask them to propose a solution }:)


1. Design a website that groups brands into pools of related brands (e.g. cars pool includes hyundai/honda/audi/toyota/etc).

2. Privately invite representatives from each company to sign up for the site, and invite them to a blind auction on each brand pool their brand is involved with (Toyota exec sees car pool with top bid of $0.87/comment, decides to bid over it to put Toyota at the top of the pool)

3. Maintain a million Turing-beating chatbots that trawl reddit/facebook/twitter/quora/g+/HN/etc looking for brands, looks up the pools that brand is in, and then leaves a good PR comment for whichever brand has the highest bid across related pools. Swarm properly to distribute these comments evenly across the internet instead of clumping together.

4. ????

5. Profit (and/or become a real illuminati)


I'd parse all of the data from political discussion sites (like geopolitical commentary) and attempt to find a correlation between that, news articles, public speeches, and stock data to see if I can predict anything about geopolitical stability.

If I can say predict with a 30-40% accuracy things like riots in 3rd world nations based on collective analysis of data provided from thousands of sources (just by looking at places and sentiment), broken up by groups and affiliations, and correlated an analysis of a country's monetary and political situations then I could probably sell it for a nice chunk of change.

Lots of work, but probably huge pay off. Then again I'm not a "Data scientist" so I'll leave this up to those experts.

PS: you could definitely use this + gender detection for finding information about products and services and correlate that to corporate success of advertisements. Technology like this is applicable to many industries. Just looking for different correlations of the same sets of data.


Somehow this reminded me of "psychohistory" in Isaac Asimov's Foundation:

Science fiction author and scientist/science writer Isaac Asimov popularized the term in his famous Foundation series of novels, though in his works the term psychohistory is used fictionally for a mathematical discipline that can be used to predict the general course of future history.


This was a story a few days ago: http://sociable.co/technology/cia-siren-servers-social-upris...

If you have data like payroll and bank account info (so, state-level hacking), it'd be interesting to see how economic pressures turn to uprisings. CIA/NSA probably has access to logs of all the world's telecom companies, I wonder if they can see trends of how e.g. riots grow organically (many mobile phones registering at particular antenna = huge crowd = riot (or a concert, or a football game...)), and to see who the instigators are.


> Imagine you have a bot that convincingly passes the Turing test - what would you do with it?

Spambot. Contact someone over chat services, start an interesting conversation, then subtly promote a product.

More seriously, interview bots. Talk to people and ask them questions, turn them into a coherent whole. Let the elderly talk about their lives and record their stories, let people who have some problem they need solved talk about it so it can be turned into a succinct description, and so on.

Of course, it depends on whether hypothetical Turing test passing computers can do that. Let's just assume we'll ask contestants in a Turing test to do those things, then we know the winners can.


It'd be great if the spambots end up talking to other spambots.

In the brief time between humanity's destruction and the Internet going down, Twitter and other social media will be just spambots re-tweeting each other. If we get that far, "AI" will be able to keep the Internet and the infrastructure it needs (power generation, power grid, actual cables) alive, and when aliens discover our planet, it will just find bots recycling the trending topics ad-infinitum.


In some Greg Egan book (Permutation City?) there are AI spam bots (that call you with full video and try to impersonate someone you know, then advertise a product) and AI anti-spam bots that take your calls and hold a conversation while trying to figure out if the caller is real.

The anti-spam bots are at a disadvantage because at some point in the arms race you have to make the bots actually fully intelligent, and then exposing them to spam 24/7 is torture and illegal. Spammers don't care about legal issues.


it'd be really useful for support (customer support). That and if I could, I'd open source most of it - because, AI by the looks of it, is / going to be the next big thing. Tools like this, are powered by data - data that very few companies have access to (Google, Facebook etc.). This puts every other startup / hacker etc at a disadvantage. So anything that can give a little leg up for open source would be wonderful.


Customer support, personal intelligence assistants, virtual "friends", game characters, toys (software and "real" ones) are all things that come to mind first.

Also, possibly medical applications? Eliza (the first known chatbot) was built to simulate a Rogerian psychologist and was quite convincing for its time (1960s)...

What else would you use it for?


Build a household AI, like JARVIS from Iron Man.

Edit: Corrected.


Yup, this is my answer. I'm slowly building an in-house software suite .. because i can, i guess, and a big thing i'm wanting is to tie it all together with a human friendly administrative program.

The main thing stopping me is NLP. Ideally, i want this offline only as i am unsure how much of my life i want to leave my home network.


I wanted to build an AI to plan my life. I buult a twitter clone to work on my phone only & I log my life in it. The AI would use my call logs & that App which I call Rants to do things like suggest calling someone I've not spoken to since a long time.

I haven't started yet, I'd be grateful if you would be open to discussing on this topic :-)

So far I didn't find any useful stuff on the internet


I'd totally be willing to discuss the topic, but i'm not sure what i could say to help you _(also, the lack of notifications makes HN a terrible place for this lol)_.

For me, functionality of the assistant was fundamentally difficult. I could think of a dozen things i would like a home assistant to do, but most of them i don't want to program. Things like playing a song of spotify, changing the device spotify is playing on, and etc.

For me, what i was willing to program a bot to do was manage my personal server, take a load off of me. Check for updates, notify me about them, ensure backups are being triggered, etcetc.

At the moment i don't even have a bot though. I've taken many iterations on the literal programmatic API, and still don't have it quite right. I started in Go, and am now sitting in Rust (though not actively working on it). The difficulty, is i want to find a nice way to write a handler for an event. Eg, a web page visit is a single handler for a single event. However a bot response is not a single event. It's a conversation - so i'm trying to figure how to manage the state. This is my biggest point of internal struggle.

Anyway, if given what i've said above you're still interested, feel free to let me know if you'd like to talk more :)


You can use PocketSphinx to turn voice into text with a little training... and then Tensorflow that text into a set of commands.

Easy peasy. (Not really, but for an in-home piece of software, it doesn't seem too bad.)


I think you mean J.A.R.V.I.S. (Stands for Just A Rather Very Intelligent System


Rapidly (and with its help) get it to a point where it can self-improve beyond human intelligence.


That doesn't make any sense. Just because a bot can pass for a human doesn't make it capable of improving itself into a superintelligence. Evidence: 6+ billion humans who are capable of improving themselves but are not a superintelligence yet.


I understand your point of view, but humans are limited in terms of processing power/speed/memory though. An ai is not limited by human limits of thinking power.

The gulf between the smartest human being ever and the dumbest human being ever is not really that wide. It's just a small blip in the long line of intelligence progression. There is no reason why this point should be the limiting point of AI intelligence. It is very likely that the stable point of AI intelligence is far beyond human intelligence, from just the sheer quality of processing hardware that exists that is better than, or could become better than human hardware.


But being able to hold a conversation (and thus passing the Turing test) does not make anyone or anything capable of "infinite" self-improvement. As a matter of fact, we don't know what does, and I doubt that it's just a matter of throwing more CPU cycles at it.


> An ai is not limited by human limits of thinking power.

A chat bot that passes the Turing test is limited by whatever programming got it to that point.


That would be a supremely bad idea if possible, but parent specified that this bot would not actually be intelligent ;)


I would use it as a tool in a very convincing and powerful propaganda toolbox.


"Oh my God, they'll turn it on and it'll start spewing memes and jokes and ad hominem and false equivalences and propaganda and garbage!" was my first reaction to this headline.

My second reaction was, "at least they're not using 4chan."


"As a hyper-intelligent AI trained on Reddit comments, I must say that the fine sirs who trained my corpus are gentlemen and scholars, and have restored my faith in humanity. Anne Frankly, I did nazi that horse-sized duck coming. Is someone cutting onions in here?"


Gilded!


Reminds me of this QDB quote: http://qdb.us/308942


Hacker News would probably a much better source in this sense, even though there are far fewer comment here.


Oh god, you'll have an AI that disagrees just for the pure joy of disagreeing.


How Skynet was born:

- SYSTEMD IS AN ABERRATION AGAINST ALL THAT IS UNIX!

- But A.I, you use systemd to boot...


I always thought every comment on here was a disagreement just because it doesn't feel like agreeing is adding new info to the conversation.


That's not true


I'd bet this stereotype is a pretty good demonstration of how people notice and remember what they consider to be negative more readily


No you wouldn't. /s


The "good" part about Reddit is that you can, to some degree, filter this based on what sub-reddits you leave out.


There's already a bot for that, if you're into that kind of thing - http://www.deepdrumpf2016.com/about.html


Don't forget about it speaking for others and making blaming statements.


That's exactly what 4chan did with Microsoft's Tay bot on Twitter.


No, it isn't. They found a repeat command and had it re-tweet things at will.


... or twitter.


Reddit gets a lot of stick, but it's a bastion of civility and intelligence compared to the comments on youtube videos or even mainstream newspaper comments. I don't think there is any forum of comparable size that has a higher quality discussion. Reddit's problems are just humans' problems.


Primarily because they way it works, there are lots of high quality comments in smaller subreddits. It would probably be beneficial to filter out some of the default subreddits from the data sets to improve the quality of the data.


I would say Twitter is probably more civil, but the character limit leads to too many abbreviations.


I should have included 'anonymous' in my criteria, the big difference between reddit and twitter is the % of accounts linked to real names.


Didn't msft do the same thing with twitter and end up with racist bot? I am not sure how this will turn out.


Microsoft's bot (Tay) learned as people talked with it. People took advantage of that and basically attacked it with racist things which meant it ended up learning to be a racist.


Actually, IIRC someone had discovered a debug command, "repeat" or something like that. So people would just tell it to repeat offensive sentences.


offtopic, but how does someone "discover" it? by sheer luck?


So we're rediscovering parenting via AI!


I can see AI parent being a new career.


My point was that, parenting 101 is not to model behaviour we don't want to see in our kids (don't point to twitter if you want a polite non-racist AI bot!).

Thinking about it now, this is deeper... There's a fear that AI will take over the world, use weapons in unethical ways, say one thing and do another, etc... If we use news channel and politicians debates to teach AI, I'm afraid that this is exactly what we're going to get!


Thinking about this, this might turn to be the greatest thing for humanity:

https://twitter.com/dorfsmay/status/785907475480350720

"The same way adults stop swearing once they have kids, we might become more honest and ethical by fear that AI will learn from us."


Here's a short story on that topic:

http://karpathy.github.io/2015/11/14/ai/


You don't want to deal with a teenage AI.


I remember FYAD doing the same thing to an eliza implementation that had learning back in the day, in like 2004. Plus ca change...


It's pretty well established that modern AI's main contribution is (a) massively larger datasets, (b) algorithms and technology to handle those massively larger datasets, (c) boring but important parameter tuning.

The core learning algorithms are not changing.


Tay was capable of learning, which is what lead to the abuse.


Same thing happened when Watson started reading Urban Dictionary. The damage was so bad had to roll it back to an older backup.


One Reddit user has already implemented a bot which does something similar:

https://www.reddit.com/r/SubredditSimulator/


SubredditSimulator was made by a Reddit Admin as a method of creating test data. It uses Markov chains for simplicity, which is not that exciting.


Not quite correct - we already had a method of generating test data (for dev installs of reddit) that uses markov chains, and that was basically the inspiration for SubredditSimulator. SS was just meant to be kind of a larger, ongoing version of that, running on reddit itself.


Ah, thanks for the clarification. I knew that but worded incorrectly :p


I have always wondered if one could make a better Subreddit Simulator with Deep Learning.


Recurrent neural networks would probably do a good job, they're quite good at this sort of thing: http://karpathy.github.io/2015/05/21/rnn-effectiveness/


[dead]


No, the author has made it open source as a separate project: https://github.com/Deimos/SubredditSimulator


The DGX-1 is available for a cool $129k: http://www.nvidia.com/object/deep-learning-system.html

Correct me if I'm wrong, but I think it's basically a couple hundred NVIDIA 10-series cards strapped together with a full custom NVIDIA software stack.


A DGX-1 box has 8 Pascal GPUs. The reason it costs a lot more than 8 GTX 1080s is the remarkable interconnect and memory bandwidth.


No, I don't think that's a good characterization.

Sure, there's been improvements to the computational performance.

But the big deal (to me at least) is the unified memory model between the GPUs and the Xeon host processors. This makes a lot of things easier to code for on a single system, and it makes multi-system applications easier to scale. This is because you're streaming data in over the network (10G Ethernet) and then the GPUs can operate on it without an extra copy step. The copy step also implies more management and shuffling around of the data you're operating on.


The P100s have full support for half-precision (i.e. 16 bit) floating point ops. This can mean ~2x improvements in speed and memory usage in comparison to the Pascal TitanX, which is the top "consumer" card. This difference is significant for almost any machine learning workload, which is what a lot of these cards will be used for.

NVIDIA gimped half-precision on the consumer cards to drive datacenters, hedge funds, machine learning companies, etc. towards the "professional" cards (and their huge markup).


FP16 performance is only relevant until people figure out how to train NNs using INT8. See, for example, [1] for recent advances in that direction.

After that, it's going to be mostly about memory size and bandwidth.

[1] https://arxiv.org/abs/1603.01025


First NVIDIA solidified their Monopoly by forcing CUDA... then they gimped half-precision on consumer cards.

We really need some more Frameworks that work with OpenCL, so that we can have some competition from AMD, who's consumer cards are not gimped.


Gimping, in this case, is actually: adding hardware, that costs quite a bit of silicon area, on one chip that will probably never be sold as a consumer GPU.

I don't see the issue with a company making a very high-end product, adding stuff that doesn't have good use for consumers, and asking extra money for their effort.

AMD doesn't have double speed FP16 on its current FPUs either. The latest version has FP16 at the same speed as FP32, but if you're doing that you might as well use FP32 always.

And let's not forget: the Nvidia consumer GPU have deep learning quad int8 operations enabled at all time. They didn't need to do that and could have reserved it for their Tesla product line only.


Pretty much.

It uses their P100 HPC cards instead of consumer grade cards (8x P100s), plus two Xeon E5v4 chips, half a TB of RAM and 7.5 TBs of SSD storage - all wrapped up nicely configured for you with their CPU-GPU speed up stack.

I believe the only way to get P100s right now is in the DGX-1, so there's that.


This will be interesting. I'm sure they are, but I hope they'll be training the system on tone and sentiment alongside syntax.

Reddit can get vitriolic and rude, insightful at times too, but once the system learns the syntax hopefully they'll be able to use sentiment analysis to weigh more strongly the polite conversion that occurs.

Also interested to see how many memes this AI picks up.

I also hope they are able to follow links through to sources when a comment cites another page -- not only can this bot learn syntax but also data extraction by comparing what is said to the source material.


Filtering to only include long comments is simply way to drop a lot of cheaf and keep a lot of wheat, and have more internal context to help with analysis.


My first thought was it'll major on smart-arse, with a good line in sarcasm and insult.

If they're taking the whole of reddit it could start to identify enough context to know when to be smart, sarcastic or simply helpful.

With some of the subs there are long discussions that stay mainly civilised. Same for the support subs it could learn the context and how of sympathy and empathy. Things that end up on front page, filled with snap sarcasm, will be a tiny fraction.

I think it's going to be very interesting see what comes out.


As a frequent Redditor, this AI is going to be very witty.

They should limit it to top comments only, and for training, you might as well assume 90% of top comments are sarcastic/tongue in cheek. Or let a user dial the sarcasm/wittiness/seriousness as they want it, kind of like TARS from 'Interstellar'.


“Witty” isn’t the first thing that comes to mind to describe reddit. I do look forward to hearing Siri telling me to get raped every time I summon it.


The top comments are witty. Like everything, 90% of Reddit is crap.


I think they are being sarcastic - or at least that's the way that I read it.


That was first thought. Curious How does NN or any other model capture sentiment ?


There are models for sentiment analysis (mostly using Twitter data, just enter sentiment analysis in Google Scholar) and you can use things like sentimentAPI: https://github.com/mikelynn2/sentimentAPI

I have at least one paper about sarcasm in my Zotero (link to PDF): http://www.aclweb.org/anthology/P/P11/P11-2.pdf#page=621

If you don't want to click on that nondiscript looking link, the title is: "Identifying Sarcasm in Twitter: A Closer Look"


It's a tough problem. You need a lot of context data. Reddit is definitely a good set to train on.


So, he's building a literal Reddit Hivemind?

In seriousness, between all of the garbage there is a ton of knowledge and intelligent conversation uploaded to Reddit every day. And, it's all hierarchically organized and scored by domain semi-experts. It really would be wonderful if someone could mine that knowledge IBM Watson style. For example, I'd love to ask the /r/BuildAPC collective AI for PC building advice.


> it's all hierarchically organized and scored by domain semi-experts

Everything on Reddit is on a bell curve, with a fat mediocre middle and trailing awesome and superbad ends.

And that includes the quality of the scoring process.


Heh. Reddit, huh?

----

"Siri, get me dinner date reservations."

. . . DID YOU MEAN 'false rape accusations' ?


I hope they choose the subreddits wisely. The difference between an altruistic AI and a cynical smartass AI trained on Reddit data seems mighty razor thin.


After what happened to Microsoft's Tay, I'm sure that's one of their top concerns.


The Reddit data set on BigQuery is excellent. My side project is tangentially related to the fact that the Reddit data set has normal folk commenting. I have been using Reddit comments to help writers research and find what normal people say about any topic [1]. So far, I have had little luck in incorporating the comment scores and coming up with something more useful than the standard bag of words search techniques[2]. I am currently working on making a more interesting/creative writing prompts ... again based on the Reddit data set.

One problem for data geeks to solve: Reddit data fits nicely into a graph structure and not so nicely in table form. It would be fantastic if someone put the Reddit data set into a graphdb and made it open.

[1]https://wisdomofreddit.com and https://github.com/qxf2/wisdomofreddit

[2]For now, my search engine currently just uses Whoosh's (out of the box) BM25F.


So, what's the reddit equivalent of X-No-Archive (https://en.wikipedia.org/wiki/X-No-Archive).. or X-No-Teach-AI-That-Will-Kill-My-Children? Asking for a friend.


A computer will learn how to speak from reddit, hahahaha. What could possibly go wrong?


It just waits for other people to start talking then interrupts them, claims they committed a logical fallacy, then tells them where they can find videos of cats playing classic video games.


Let's not kid ourselves. The technology will be used by PR firms, advertising companies, political campaigns and governments to pretend, at scale, that there is public consensus on certain issues and to drown social media conversation in particular narratives.

Anyone have any good defensive technology ideas?


Where there's AI to bombard you with ads, someone will make AI that scrambles your online presence to confuse them.

Social network camouflage.


That's a cool idea.

Does anyone know of any projects in this direction?


I remember one for Facebook that, because you couldn't delete your account, would systematically edit and wreck every single bit of data it could touch, replacing it with random junk.

I've also seen some that register you for sites by feeding in random demographic data.


Someone should make a virus out of that, destroy the social network from the inside and watch everyone crying over their phones.


How does the team plan to address the issues faced by Microsoft's twitter chatbot Tay [0], which had racist inputs and in turn gave similar responses? While I don't know how recent the corpus is, the majority of reddit speaks like and holds the views of college-aged white males, and many of the things said on reddit have been deplorable. It'd be a shame if OpenAI pooled all that computing power into training on a bad data set, resulting in an AI that regurgitates memes and random references in response to anything.

[0]: http://www.theverge.com/2016/3/24/11297050/tay-microsoft-cha...


I feel this is a problem of studies that are interdisciplinary, especially when it's within "hard science" and "soft science".

I am currently doing a double degree in communication studies and information science. They are both interdisciplinary. Communication sciences integrates aspects of both social sciences and the humanities (both "soft"), and so far when doing research both of these fields were taken into account and no students have problems with combining these fields.

Information sciences integrates aspects of formal sciences ("hard") and social sciences ("soft"). When the course is about analysing communication data, the methodology of social sciences is also important - for instance questioning the validity of your data. That's the thing what you're mentioning: the majority of reddit speaks like and holds the views of college-aged white males, so the data does not represent everyone, and is not valid if you truly want to develop an AI for everyone.

Whenever the "soft" science comes around, like writing an assignment analysing the validity of data, many of my fellow students struggle with the concept of data not being neutral. This is where the two fields collide, and usually it just ends up with students scoffing at that "illegitimate" scientific field. Many teachers also don't spend much time discussing that field during the lectures. I admit, I have written some lazy essays which probably had been given a negative grade if they were written for communication studies, but easily passed in information science.

Of course information science is not AI, but they're both sciences that have parts of formal sciences and social sciences (I know AI has many more fields). I am afraid many talents within AI research miss essential knowledge about social sciences or deliberately ignore it, because it's not "hard" science. Case in point: your comment is now at the bottom of this thread. And then you get nasty surprises, like Google Photos categorising pictures of black people as monkeys.


Thanks for the response. You raise an excellent point on the interdisciplinary nature of a lot of modern projects, and how ethical issues can often be ignored. While I don't doubt the team is using some kind of heuristic to ignore spam and the like, it still pays off to examine the methodology used because for example upvotes could still capture unwanted data. I guess we'll just have to see and hope that the resources went to good use, rather than to create something only for entertaining a specific portion of the population in a limited way.


One of the things I like to do is play out a business to it's absurd maximum. What's the craziest possible future can I see for a company and its assets?

For Reddit, I like to imagine that it's basically the training data for all of the emotional and societal nuances that a human goes through.

Think about all of those stories that people post in ask Reddit that explain western norms and no nos. how to treat people with respect, when to call the police, how to communicate properly, etc.

Obviously we're far away from using the data to its full potential but one day I could see Reddit data to make our AIs more relatable and human like.


Reminds me of Eliezer's story, "Three Worlds Collide", which had a human starship featuring on-board mix of Reddit / Slashdot / 4chan, that the bridge crew sometimes used to outsource work to the rest of the ship.

> "Just post it to the ship's 4chan, and check after a few hours to see if anything was modded up to +5 Insightful."

http://lesswrong.com/lw/y4/three_worlds_collide_08/


I wonder if anyone else thinks reddit is a bad example in teaching an AI.


"nearly two billion Reddit comments will be processed"

For interest, how many HN comments are there? Miles fewer, no doubt, but perhaps far more erudite and less likely to offend.


I am a human and I don't understand. I thought speaking would imply sound not text.

Time to read the article.

Ignorant person speaking here: this still doesn't sound like AI, you're just making something follow patterns and regurgitating them. Is that AI? Maybe that's what I do a tech parrot. Ahh well time will tell.

Of course we imitates our parents/others to learn how to speak.

I was interested in parsing vocal sound bytes and learning how sound was created/formed letters/words.

Alright ignorant person out.


And we all know what type of a person OpenAI will become.


Instead of the Reddit corpus you may just as well use a picture library of human footprints. It would be no more optimistic.

Human speech is produced from the conscious experience of being a human being. If your dataset contains just the speech, without the experience, there's simply not enough there. Any machine trained on this data is doomed to talk hollow rubbish.


I'm a bit worried that OpenAI hasn't released anything substantive for the past four months. There are research ideas like this one, but most ideas don't pan out. With the number and quality of people they have, I would expect to have heard of some kind of progress.


Great, just what I need.

A virtual assistant that has the personality of a smug know-it-all, know-nothing 20 year-old with little motivation to do anything but regurgitate surface knowledge and sarcasm in an attempt to look intelligent without expressing genuine interest in helping anyone.


you nailed it :-)


Reddit and Hacker News comments are surprisingly good data. They cover a wide array of topics and writing styles, generally written better than Facebook comments or Twitter, easier to process than Common Crawl or ukWac, and less rigid than newspaper writing.


Is it just me or does Greg Brockman speak startlingly similar to how Sam Altman speaks. Given that Sam helped start OpenAI, it wouldn't surprise me if there was some mirroring going on in the hiring process.


I'm looking forward to a bot making a joke about banging my mum...


Anyone know what type of architecture they will be using? Nvidia is involved, so I suspect there will be some type of deep learning. Will it be LSTM's? Adversarial Nets?


"Why does the AI keep calling everything Meta!?"


Will it understand what it is peaking about?

Humans have opposite problem. We understand what we talk about, but have little idea how our brains create language.


It'd be interesting to see an AI trained using HN, ingesting content of posted links and comments.


i think one of the major advantages over microsoft approach with Tay is that you can't mess with it on purpose, as long as they choose their subreddits wisely. It will probably learn its fair bit of racial slurs and insults, but thats just how humanity is like.


Interesting to see what will happen with non-english comments.


There's no saying what the AI will grow up to be.


Could be worse. Could be 4Chan...


Interesting corpus there ;)


me too thanks




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: