Back in 2007, mobile phones used a system called T9 from Nuance corp which was trained on a word corpus taken from IRC and similar chats.
This caused all kinds of issues - the mobile phones would accept offensive words like "naziparking" but reject normal language like "world peace".
Using reddit may lead to ... surprises.
I had gotten the impression that happened because they were using an algorithmic compression engine that favored false positives over false negatives, i.e. they had essentially trained an algorithm to generate "strings of characters looking like words" as a means of compression and some weird strings of characters were determined by the algorithm to be words even though they shouldn't have been.
Might have been an exclusively German language problem, or maybe they cleaned it up since that article was published. I have a phone with T9 and it works remarkably well. Heck, I would prefer it to Swype. (Of course, it's not useful on smartphones, because they no longer have real keys.)
t9 was AMAZING and i miss it every day. faster and more accurate than swype for sure, especially since inputs for t9 are deterministic.
very easy to text even from your pocket etc - something you cant do on a smartphone - notice texting accidents became more common with the widespread adaptation of smartphones ;)
Note that you're comparing a touchscreen to tactile keys in that statement too, though. I could text from my pocket without T9, but try doing that with a touchscreen WITH T9...
And while we're doing that touch vs. tactile comparison, what I also loved about the old "dumb phones" was deterministic timing. I could do stuff on those phones from start to end without looking at the screen - I quickly learned and remembered that e.g. this operation completed instantly, and that menu has ~1/4 of loading time, etc.
I think this feature was actually key to being able to memorize how to perform operations quickly. There's no way you can do that with modern smartphone, where every other interaction lags for anything between 10 and 1000 ms, and inputs are sometimes dropped at random. It's this non-determinism of smartphone UI that makes me look at the screen all the time when using it.
I think this is actually because of the changing interface. It doesn't matter how long the interface takes to be ready if the keystrokes are buffered (which they were), so you could press "menu down down down ok" and the thing would work, even if the menu took two seconds to show up.
You can't do that when you have to actually wait for the thing you want to select to show, otherwise you'll be selecting something completely different.
I am really hoping that we reach peak touchscreen soon, and companies can start exploring other form factors again. I want a fast CPU and wifi in something that isn't just a giant flat slab of glass.
Texting also became more common with the widespread adoption of smartphones - we'd of course see more accidents as a result, but the medium became massively more useful to a lot more people.
The Reddit comment corpus is an awesome dataset. There's relatively little mark-up to scrub out, low duplication, good metadata, and a variety of topics.
Btw, the above was run on CPU in a couple of days, because spaCy doesn't use GPUs yet. I've applied for a grant from NVidia so I can fix that. If anyone from NVidia is reading, email me? :)
Unless you're replicating someone else's thing exactly, You can't really get by with one training process. You want to be trying different things, and running a few samples of each configuration to account for random variation. I'm not even talking about decadent hyper-parameter sweeps to fine-tune. I'm talking like, how wide do I need my layers to be, what optimizers are good, how deep should I make the network, etc.
I want to be training 5-10 models at a time minimum. 20-30 would be much more productive. If I can only train one model at once, it's not really worth the effort — it's better to work on one of the other tickets for the library.
^ That. For all that people say about spot instances, there's no infrastructure I know if to manage jobs and have them migrate to higher priced instances without losing state.
You can always snapshot and keep track of state as you go (a little bit tricky with Spark, though). We use spot instances for training we know is not vital (as in, has to be done, but rather run it twice and save money anyway that run it for sure). Also, once you know what availability specific instances have you can always choose better (i.e. maybe c3.xlarge is slightly more expensive as spot than large, you can do with large... but xlarge has almost no shutdowns)
> But when we ran the model on more data, and it was gone and soon forgotten. Just like Carrot Top.
"it was gone" meaning the association between Carrot Top and Kate Mara? So after better training, who is now most_similar(['Carrot_Top|PERSON'])?
EDIT: RTFA, used the interactive demo. most_similar() is now a category I would describe as "actors/comedians popular in the 90's: "Bill Murray, Gary Busey, David Spade, Charlie Sheen, Ashton Kutcher, Chris Farley".
Weirdly there's a bug that's dropped PERSON from the sense list. Fixing.
Edit: Fixed.
Edit2: Ah this is super misleading atm. I'll have a think about how to do this better.
auto is case insensitive, but if you specify a sense, it's case sensitive. So you need to do "Carrot Top" and set PERSON.
Btw, contrast with carrot top "NOUN".
Why do you not allow google to index discussion threads?
Sometimes I can remember the comments on an article, but not the article itself. Unfortunately I can't search google using comment text, because Reddit doesn't allow google to index its comments.
Those only block the various sorts and individual comment links. The main comments pages are still searchable.
Google was getting overzealous and indexing every page hundreds of times because it would follow every link, which included every "context" link and every sort.
A lot of times, I see an interesting quote or word on reddit, and I google to find out more, and the first (sometimes only) result is the comment itself. That even happens with comments that are less than an hour old.
Which one is which? Thanks in advance if you know, because i'd rather not download a huge torrent just to find out I should have been using that bandwidth to download a different one.
> However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.
Even then, that’s easily doable on a consumer system.
I’ll download it in the night between friday and saturday, after I install my new HDD, and just run queries over it for fun. (far slower, but also far cheaper than BigQuery. Even at German electricity prices).
I'm going to make a prediction: Soon after the first chatbot passes a Turing Test, there will be many more to follow, they will get better and better, and the methods will be so interlinked that there will be no way to defend it as proprietary software. The data too will be open source--the public reddit dataset already has tons of value in it.
The question then is, "When everyone has access to free chatbots that can pass the turing test, what will they be used for?" The answer is "tons of stuff", and lots of people will try it at once. I think many applications will be niche.
I agree that making chatbots passing the Turing Test will probably be something so common in the future that it will be a student project of medium complexity.
However, we are not there yet. If someone had this today, how valuable would this be to Apple, Google, Facebook, Microsoft and Amazon?
As for the definition of what the Turing Test is, it's definitely a fuzzy subject. My own arbitrary definition is "ability to convince a human that he's talking to another human after a sustained (based on time or length of) conversation whereas human is aware that there's a possibility that his interlocutor maybe a machine".
So, it's more of convincing a judge in Loebner Prize competition than a random troll on Twitter.
Woah! I thought they pretty much already have passed it? Remember Ashley Madison? You had ~12 million heterosexual men (that were cheaters, 6 million 'active' users) trying to talk to ~12k heterosexual women (also cheaters, 10k 'active' accounts). It ends up being about a 1:13,000 ratio. Not only that but MANY of these men had paid actual real money to the site in order to do so, and then continued to do so. The only real conclusion was that most of the men were talking to bots that the site had made up.
Ok, lets get this straight: ~6 million real human men paid real money that they earned through their labor or whatever to talk to bots and then paid more real money to do it again. Admittedly, they are 'cheaters', but 6 million men must have an IQ distribution nearly identical to that of the general population, i.e. they represent heterosexual human males in general. And yes, they were trying to get laid, these conversations are likely pretty brief, and mammalian males are not generally known for using their neocortex during mating.
Still, I think that 'counts' as far as passing the Turing Test. Yes, now we can move the goal posts to say that the bot has to teach me something, or guess what I was thinking, or generally be better than a man on tinder. But as a first pass of the TT, I think we have been here for a few years now.
There's no reason 6 million non-randomly selected men would be likely to have an IQ distribution similar to the general population. You can't make up for non-random selection with a larger sample size.
Ok fine, but then how far off of the mean should they then be? They aren't all super smart nor are mentally deficient as they have to be able to function in society to make money enough to pay for the service. At most this is what, an IQ mean of 75-125. So that then means at the low end, TT chat-bots can fool human males of IQs of 75. That's pretty darn good and that was 4+ years ago.
One application I predict is spam, from nornal spam being less detectable to those spam bots that add you on chat platforms and try to get your card info, responding to what you type.
We've already had a few chatbots pass a turing test. Just none that are particularly sophisticated. One in particular just acted "cheeky" to throw off the other person.
How good is it? Can it talk to me, learn what I know, and more importantly learn what I DON'T know? Can it use that information to help me learn various things?
Online lectures are great, but a personalized tutor could change things. If I restate back my understanding of a subject, and it clearly tells me why I'm mistaken, that's useful.
Reddit does this, kind of, today... but it's not really from an informed position. It's mostly uninformed people arguing with equally uninformed people. There are gems occasionally, but it's rare. That's why /r/depthhub was created
If a bot could talk to me, and read Wikipedia, and figure out how to get me from the place I am in understanding a topic to where the Wikipedia explanation is, that would be crazy amazing. I don't even know if this would be particularly intractable at this point... the figuring out where I am currently at in understanding would probably be the difficult part.
It can learn from you and from other people, plus it has a vast internal knowledge base so it will know something that you don't know.
While it can answer your questions and check your understanding (it's a First Order Logic application), I haven't thought of the educational applications but I don't see why not... Thank you for bringing this up.
I would use it to pre-process Github issues for me. Attempt to reduce some random user's rambling to a clear set of repro instructions, prompt for more information from the user when necessary, ping other devs' chatbots for help. Basically, issue templates, taken to the next level.
Also, I would use it to troll bureaucrats when I give up in frustration. It should try to force the bureaucrat to admit flat out that their logic is fundamentally flawed, and then ask them to propose a solution }:)
1. Design a website that groups brands into pools of related brands (e.g. cars pool includes hyundai/honda/audi/toyota/etc).
2. Privately invite representatives from each company to sign up for the site, and invite them to a blind auction on each brand pool their brand is involved with (Toyota exec sees car pool with top bid of $0.87/comment, decides to bid over it to put Toyota at the top of the pool)
3. Maintain a million Turing-beating chatbots that trawl reddit/facebook/twitter/quora/g+/HN/etc looking for brands, looks up the pools that brand is in, and then leaves a good PR comment for whichever brand has the highest bid across related pools. Swarm properly to distribute these comments evenly across the internet instead of clumping together.
I'd parse all of the data from political discussion sites (like geopolitical commentary) and attempt to find a correlation between that, news articles, public speeches, and stock data to see if I can predict anything about geopolitical stability.
If I can say predict with a 30-40% accuracy things like riots in 3rd world nations based on collective analysis of data provided from thousands of sources (just by looking at places and sentiment), broken up by groups and affiliations, and correlated an analysis of a country's monetary and political situations then I could probably sell it for a nice chunk of change.
Lots of work, but probably huge pay off. Then again I'm not a "Data scientist" so I'll leave this up to those experts.
PS: you could definitely use this + gender detection for finding information about products and services and correlate that to corporate success of advertisements. Technology like this is applicable to many industries. Just looking for different correlations of the same sets of data.
Somehow this reminded me of "psychohistory" in Isaac Asimov's Foundation:
Science fiction author and scientist/science writer Isaac Asimov popularized the term in his famous Foundation series of novels, though in his works the term psychohistory is used fictionally for a mathematical discipline that can be used to predict the general course of future history.
If you have data like payroll and bank account info (so, state-level hacking), it'd be interesting to see how economic pressures turn to uprisings. CIA/NSA probably has access to logs of all the world's telecom companies, I wonder if they can see trends of how e.g. riots grow organically (many mobile phones registering at particular antenna = huge crowd = riot (or a concert, or a football game...)), and to see who the instigators are.
> Imagine you have a bot that convincingly passes the Turing test - what would you do with it?
Spambot. Contact someone over chat services, start an interesting conversation, then subtly promote a product.
More seriously, interview bots. Talk to people and ask them questions, turn them into a coherent whole. Let the elderly talk about their lives and record their stories, let people who have some problem they need solved talk about it so it can be turned into a succinct description, and so on.
Of course, it depends on whether hypothetical Turing test passing computers can do that. Let's just assume we'll ask contestants in a Turing test to do those things, then we know the winners can.
It'd be great if the spambots end up talking to other spambots.
In the brief time between humanity's destruction and the Internet going down, Twitter and other social media will be just spambots re-tweeting each other. If we get that far, "AI" will be able to keep the Internet and the infrastructure it needs (power generation, power grid, actual cables) alive, and when aliens discover our planet, it will just find bots recycling the trending topics ad-infinitum.
In some Greg Egan book (Permutation City?) there are AI spam bots (that call you with full video and try to impersonate someone you know, then advertise a product) and AI anti-spam bots that take your calls and hold a conversation while trying to figure out if the caller is real.
The anti-spam bots are at a disadvantage because at some point in the arms race you have to make the bots actually fully intelligent, and then exposing them to spam 24/7 is torture and illegal. Spammers don't care about legal issues.
it'd be really useful for support (customer support).
That and if I could, I'd open source most of it - because, AI by the looks of it, is / going to be the next big thing. Tools like this, are powered by data - data that very few companies have access to (Google, Facebook etc.). This puts every other startup / hacker etc at a disadvantage. So anything that can give a little leg up for open source would be wonderful.
Customer support, personal intelligence assistants, virtual "friends", game characters, toys (software and "real" ones) are all things that come to mind first.
Also, possibly medical applications? Eliza (the first known chatbot) was built to simulate a Rogerian psychologist and was quite convincing for its time (1960s)...
Yup, this is my answer. I'm slowly building an in-house software suite .. because i can, i guess, and a big thing i'm wanting is to tie it all together with a human friendly administrative program.
The main thing stopping me is NLP. Ideally, i want this offline only as i am unsure how much of my life i want to leave my home network.
I wanted to build an AI to plan my life. I buult a twitter clone to work on my phone only & I log my life in it. The AI would use my call logs & that App which I call Rants to do things like suggest calling someone I've not spoken to since a long time.
I haven't started yet, I'd be grateful if you would be open to discussing on this topic :-)
So far I didn't find any useful stuff on the internet
I'd totally be willing to discuss the topic, but i'm not sure what i could say to help you _(also, the lack of notifications makes HN a terrible place for this lol)_.
For me, functionality of the assistant was fundamentally difficult. I could think of a dozen things i would like a home assistant to do, but most of them i don't want to program. Things like playing a song of spotify, changing the device spotify is playing on, and etc.
For me, what i was willing to program a bot to do was manage my personal server, take a load off of me. Check for updates, notify me about them, ensure backups are being triggered, etcetc.
At the moment i don't even have a bot though. I've taken many iterations on the literal programmatic API, and still don't have it quite right. I started in Go, and am now sitting in Rust (though not actively working on it). The difficulty, is i want to find a nice way to write a handler for an event. Eg, a web page visit is a single handler for a single event. However a bot response is not a single event. It's a conversation - so i'm trying to figure how to manage the state. This is my biggest point of internal struggle.
Anyway, if given what i've said above you're still interested, feel free to let me know if you'd like to talk more :)
That doesn't make any sense. Just because a bot can pass for a human doesn't make it capable of improving itself into a superintelligence. Evidence: 6+ billion humans who are capable of improving themselves but are not a superintelligence yet.
I understand your point of view, but humans are limited in terms of processing power/speed/memory though. An ai is not limited by human limits of thinking power.
The gulf between the smartest human being ever and the dumbest human being ever is not really that wide. It's just a small blip in the long line of intelligence progression. There is no reason why this point should be the limiting point of AI intelligence. It is very likely that the stable point of AI intelligence is far beyond human intelligence, from just the sheer quality of processing hardware that exists that is better than, or could become better than human hardware.
But being able to hold a conversation (and thus passing the Turing test) does not make anyone or anything capable of "infinite" self-improvement. As a matter of fact, we don't know what does, and I doubt that it's just a matter of throwing more CPU cycles at it.
"Oh my God, they'll turn it on and it'll start spewing memes and jokes and ad hominem and false equivalences and propaganda and garbage!" was my first reaction to this headline.
My second reaction was, "at least they're not using 4chan."
"As a hyper-intelligent AI trained on Reddit comments, I must say that the fine sirs who trained my corpus are gentlemen and scholars, and have restored my faith in humanity. Anne Frankly, I did nazi that horse-sized duck coming. Is someone cutting onions in here?"
Reddit gets a lot of stick, but it's a bastion of civility and intelligence compared to the comments on youtube videos or even mainstream newspaper comments. I don't think there is any forum of comparable size that has a higher quality discussion. Reddit's problems are just humans' problems.
Primarily because they way it works, there are lots of high quality comments in smaller subreddits. It would probably be beneficial to filter out some of the default subreddits from the data sets to improve the quality of the data.
Microsoft's bot (Tay) learned as people talked with it. People took advantage of that and basically attacked it with racist things which meant it ended up learning to be a racist.
My point was that, parenting 101 is not to model behaviour we don't want to see in our kids (don't point to twitter if you want a polite non-racist AI bot!).
Thinking about it now, this is deeper... There's a fear that AI will take over the world, use weapons in unethical ways, say one thing and do another, etc... If we use news channel and politicians debates to teach AI, I'm afraid that this is exactly what we're going to get!
It's pretty well established that modern AI's main contribution is (a) massively larger datasets, (b) algorithms and technology to handle those massively larger datasets, (c) boring but important parameter tuning.
Not quite correct - we already had a method of generating test data (for dev installs of reddit) that uses markov chains, and that was basically the inspiration for SubredditSimulator. SS was just meant to be kind of a larger, ongoing version of that, running on reddit itself.
Correct me if I'm wrong, but I think it's basically a couple hundred NVIDIA 10-series cards strapped together with a full custom NVIDIA software stack.
Sure, there's been improvements to the computational performance.
But the big deal (to me at least) is the unified memory model between the GPUs and the Xeon host processors. This makes a lot of things easier to code for on a single system, and it makes multi-system applications easier to scale. This is because you're streaming data in over the network (10G Ethernet) and then the GPUs can operate on it without an extra copy step. The copy step also implies more management and shuffling around of the data you're operating on.
The P100s have full support for half-precision (i.e. 16 bit) floating point ops. This can mean ~2x improvements in speed and memory usage in comparison to the Pascal TitanX, which is the top "consumer" card. This difference is significant for almost any machine learning workload, which is what a lot of these cards will be used for.
NVIDIA gimped half-precision on the consumer cards to drive datacenters, hedge funds, machine learning companies, etc. towards the "professional" cards (and their huge markup).
Gimping, in this case, is actually: adding hardware, that costs quite a bit of silicon area, on one chip that will probably never be sold as a consumer GPU.
I don't see the issue with a company making a very high-end product, adding stuff that doesn't have good use for consumers, and asking extra money for their effort.
AMD doesn't have double speed FP16 on its current FPUs either. The latest version has FP16 at the same speed as FP32, but if you're doing that you might as well use FP32 always.
And let's not forget: the Nvidia consumer GPU have deep learning quad int8 operations enabled at all time. They didn't need to do that and could have reserved it for their Tesla product line only.
It uses their P100 HPC cards instead of consumer grade cards (8x P100s), plus two Xeon E5v4 chips, half a TB of RAM and 7.5 TBs of SSD storage - all wrapped up nicely configured for you with their CPU-GPU speed up stack.
I believe the only way to get P100s right now is in the DGX-1, so there's that.
This will be interesting. I'm sure they are, but I hope they'll be training the system on tone and sentiment alongside syntax.
Reddit can get vitriolic and rude, insightful at times too, but once the system learns the syntax hopefully they'll be able to use sentiment analysis to weigh more strongly the polite conversion that occurs.
Also interested to see how many memes this AI picks up.
I also hope they are able to follow links through to sources when a comment cites another page -- not only can this bot learn syntax but also data extraction by comparing what is said to the source material.
Filtering to only include long comments is simply way to drop a lot of cheaf and keep a lot of wheat, and have more internal context to help with analysis.
My first thought was it'll major on smart-arse, with a good line in sarcasm and insult.
If they're taking the whole of reddit it could start to identify enough context to know when to be smart, sarcastic or simply helpful.
With some of the subs there are long discussions that stay mainly civilised. Same for the support subs it could learn the context and how of sympathy and empathy. Things that end up on front page, filled with snap sarcasm, will be a tiny fraction.
I think it's going to be very interesting see what comes out.
As a frequent Redditor, this AI is going to be very witty.
They should limit it to top comments only, and for training, you might as well assume 90% of top comments are sarcastic/tongue in cheek. Or let a user dial the sarcasm/wittiness/seriousness as they want it, kind of like TARS from 'Interstellar'.
There are models for sentiment analysis (mostly using Twitter data, just enter sentiment analysis in Google Scholar) and you can use things like sentimentAPI: https://github.com/mikelynn2/sentimentAPI
In seriousness, between all of the garbage there is a ton of knowledge and intelligent conversation uploaded to Reddit every day. And, it's all hierarchically organized and scored by domain semi-experts. It really would be wonderful if someone could mine that knowledge IBM Watson style. For example, I'd love to ask the /r/BuildAPC collective AI for PC building advice.
I hope they choose the subreddits wisely. The difference between an altruistic AI and a cynical smartass AI trained on Reddit data seems mighty razor thin.
The Reddit data set on BigQuery is excellent. My side project is tangentially related to the fact that the Reddit data set has normal folk commenting. I have been using Reddit comments to help writers research and find what normal people say about any topic [1]. So far, I have had little luck in incorporating the comment scores and coming up with something more useful than the standard bag of words search techniques[2]. I am currently working on making a more interesting/creative writing prompts ... again based on the Reddit data set.
One problem for data geeks to solve: Reddit data fits nicely into a graph structure and not so nicely in table form. It would be fantastic if someone put the Reddit data set into a graphdb and made it open.
It just waits for other people to start talking then interrupts them, claims they committed a logical fallacy, then tells them where they can find videos of cats playing classic video games.
Let's not kid ourselves. The technology will be used by PR firms, advertising companies, political campaigns and governments to pretend, at scale, that there is public consensus on certain issues and to drown social media conversation in particular narratives.
I remember one for Facebook that, because you couldn't delete your account, would systematically edit and wreck every single bit of data it could touch, replacing it with random junk.
I've also seen some that register you for sites by feeding in random demographic data.
How does the team plan to address the issues faced by Microsoft's twitter chatbot Tay [0], which had racist inputs and in turn gave similar responses? While I don't know how recent the corpus is, the majority of reddit speaks like and holds the views of college-aged white males, and many of the things said on reddit have been deplorable. It'd be a shame if OpenAI pooled all that computing power into training on a bad data set, resulting in an AI that regurgitates memes and random references in response to anything.
I feel this is a problem of studies that are interdisciplinary, especially when it's within "hard science" and "soft science".
I am currently doing a double degree in communication studies and information science. They are both interdisciplinary. Communication sciences integrates aspects of both social sciences and the humanities (both "soft"), and so far when doing research both of these fields were taken into account and no students have problems with combining these fields.
Information sciences integrates aspects of formal sciences ("hard") and social sciences ("soft"). When the course is about analysing communication data, the methodology of social sciences is also important - for instance questioning the validity of your data. That's the thing what you're mentioning: the majority of reddit speaks like and holds the views of college-aged white males, so the data does not represent everyone, and is not valid if you truly want to develop an AI for everyone.
Whenever the "soft" science comes around, like writing an assignment analysing the validity of data, many of my fellow students struggle with the concept of data not being neutral. This is where the two fields collide, and usually it just ends up with students scoffing at that "illegitimate" scientific field. Many teachers also don't spend much time discussing that field during the lectures. I admit, I have written some lazy essays which probably had been given a negative grade if they were written for communication studies, but easily passed in information science.
Of course information science is not AI, but they're both sciences that have parts of formal sciences and social sciences (I know AI has many more fields). I am afraid many talents within AI research miss essential knowledge about social sciences or deliberately ignore it, because it's not "hard" science. Case in point: your comment is now at the bottom of this thread. And then you get nasty surprises, like Google Photos categorising pictures of black people as monkeys.
Thanks for the response. You raise an excellent point on the interdisciplinary nature of a lot of modern projects, and how ethical issues can often be ignored. While I don't doubt the team is using some kind of heuristic to ignore spam and the like, it still pays off to examine the methodology used because for example upvotes could still capture unwanted data. I guess we'll just have to see and hope that the resources went to good use, rather than to create something only for entertaining a specific portion of the population in a limited way.
One of the things I like to do is play out a business to it's absurd maximum. What's the craziest possible future can I see for a company and its assets?
For Reddit, I like to imagine that it's basically the training data for all of the emotional and societal nuances that a human goes through.
Think about all of those stories that people post in ask Reddit that explain western norms and no nos. how to treat people with respect, when to call the police, how to communicate properly, etc.
Obviously we're far away from using the data to its full potential but one day I could see Reddit data to make our AIs more relatable and human like.
Reminds me of Eliezer's story, "Three Worlds Collide", which had a human starship featuring on-board mix of Reddit / Slashdot / 4chan, that the bridge crew sometimes used to outsource work to the rest of the ship.
> "Just post it to the ship's 4chan, and check after a few hours to see if anything was modded up to +5 Insightful."
I am a human and I don't understand. I thought speaking would imply sound not text.
Time to read the article.
Ignorant person speaking here: this still doesn't sound like AI, you're just making something follow patterns and regurgitating them. Is that AI? Maybe that's what I do a tech parrot. Ahh well time will tell.
Of course we imitates our parents/others to learn how to speak.
I was interested in parsing vocal sound bytes and learning how sound was created/formed letters/words.
Instead of the Reddit corpus you may just as well use a picture library of human footprints. It would be no more optimistic.
Human speech is produced from the conscious experience of being a human being. If your dataset contains just the speech, without the experience, there's simply not enough there. Any machine trained on this data is doomed to talk hollow rubbish.
I'm a bit worried that OpenAI hasn't released anything substantive for the past four months. There are research ideas like this one, but most ideas don't pan out. With the number and quality of people they have, I would expect to have heard of some kind of progress.
A virtual assistant that has the personality of a smug know-it-all, know-nothing 20 year-old with little motivation to do anything but regurgitate surface knowledge and sarcasm in an attempt to look intelligent without expressing genuine interest in helping anyone.
Reddit and Hacker News comments are surprisingly good data. They cover a wide array of topics and writing styles, generally written better than Facebook comments or Twitter, easier to process than Common Crawl or ukWac, and less rigid than newspaper writing.
Is it just me or does Greg Brockman speak startlingly similar to how Sam Altman speaks. Given that Sam helped start OpenAI, it wouldn't surprise me if there was some mirroring going on in the hiring process.
Anyone know what type of architecture they will be using? Nvidia is involved, so I suspect there will be some type of deep learning. Will it be LSTM's? Adversarial Nets?
i think one of the major advantages over microsoft approach with Tay is that you can't mess with it on purpose, as long as they choose their subreddits wisely. It will probably learn its fair bit of racial slurs and insults, but thats just how humanity is like.
Source: http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-mobi...
Translated by Google: https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=...