Wait... the naive-Bayes was trained on Yelp data? Isn't Yelp data also crowd-sourced information? I may not be thinking about this right, but it seems to me that training the classifier on crowd-sourced data and then comparing that to Mechanical Turk... that in the end you're just comparing the quality of the crowd-sourced data to each other?
In fact you're comparing individually crowdsourced data to massively mechanically aggregated crowdsourced data. When viewed like this the results are not in the least surprising.
Making good Turk tasks is a science in of in itself. Figuring out the incentive is the key, and sometimes you have to think a bit out of the box.
We actually used turking at my company for some really nutty stuff, logo generation. Basically we'd give people a URL and ask them to generate a 160x40 logo for it. We had some base rules, like the background had to be solid, have no scaling artifacts etc..
We assigned each logo to five people.
Our reward was essentially this:
- anybody who met all the rules, got .25c
- the best of all that met the rules, got a 50c bonus
It took a few days for people to get the hang of it, but after that we consistently got excellent results, with some really creative stuff coming back. Yes, we were paying up to $1.50 for the logos, but we weren't using them for every site, only the really popular ones, and having it automated made it worth it. Every day we spent maybe 60 seconds picking the best logo of five submissions for a few dozen sites, everything else was automated.
The product that used these by the way is NewsRoom, a pretty sexy RSS reader available on Android. All the logos you see for sites there were generated by Turkers.
Anyways, finding the right equation for that task took some experimentation, but I was impressed by the results in the end.
79 passed. This was an extremely basic multiple choice test.
It makes one wonder how the other 4,581 were smart enough to
operate a web browser in the first place.
I stopped reading right there.
As for the question itself, that's simple: people come for the money, and since "Turkers" are paid pennies for those tasks that means they have to do a lot of them; so replying randomly on a test is a no-brainer (I wouldn't even bother to click and type and just write a script).
It's a good thing we've got these magazines reminding us how we are so smart and the rest of the world is so stupid. What would I do without my over-inflated ego?
On my surveys on MT I ask things like "Why did you choose this answer?". I also made a little script library to record the times they enter answers in fields. I throw out any that appear to be scripted or were so rapidly answered as to not have been a real answer.
At the price per HIT I paid (0.08 to 0.15), I only had to throw out 2-3 answers out of ~2000 due to someone trying to reply randomly.
Yes, but why keep reading an article that insults people ... the reason for the low accuracy was not the point of my comment.
My own father is "not smart enough" to operate a browser. Lack of English skills don't help him. But he can read French and Russian just fine, he has a Ph.D in his profession and a carrier in politics (former advisor to the prime minister, currently a senator in a eastern-European country).
If you read the paper, you are correct. It states that after surveying the people taking the test, that they came to the conclusion people just answered, hoping to get access to tasks as fast as possible.
Also, the paper says 1658 passed, but probably only 79 passed with > 90% accuracy?
They don't mention the price per HIT. If they're paying between $0.01 and $0.05 for these HITs, I'm not surprised by these results.
I looked at the cited paper and did not see the cost, but without the cost I really would not bother interpreting these results. "Machines work for electricity; humans need real money. News at 11."
Now there's an idea. It would be a beautiful irony if, in a few years from now, the mechanical turk API was used as an open platform for AI applications to make money solving difficult problems.
I've always had a bizarre idea that it would be awesome to apply for some low level data entry job at a low-tech company with no programming staff. Then automate the task and get loads of work done while not actually being at the office, repeat this with several other jobs until the sum of the data entry jobs pay is greater than that of an individual programmer.
But then I realized that, in most offices, work done is meaningless next to number of hours spent in the building ;)
Actually if they are true bayesians they probably dream of the surrounding landscape, the quality of grass the sheep graze on, seasonal weather conditions, available water quality and probably most important, an entire subset of classifiers related to the competence of the shepherds.
Did anyone else read the paper? The summary doesn't seem very correct to me.
From the summary:
The results weren't pretty: in order to find a population of Turkers whose work was passable, the researchers first used Mechanical Turk to administer a test to 4,660 applicants. It was a multiple choice test to determine whether or not a Turker could identify the correct category for a business (Restaurant, Shopping, etc.) and verify, via its official website or by phone, its correct phone number and address.
79 passed. This was an extremely basic multiple choice test. It makes one wonder how the other 4,581 were smart enough to operate a web browser in the first place.
From the paper:
Of the 4,660 workers who took this test, only 1,658 (35.6%) workers earned a passing score, and over 25% of workers answered fewer than half of the questions correctly.
To investigate the high failure rate, we conversed with workers directly on TurkerNation and through private email. Based upon worker’s names and email addresses, we believe that we conversed with a representative sample of workers both inside and outside the United States. We found that the test was not too difficult and that most workers comprehended the questions. We believe that many applicants simply try to gain access to tasks as quickly as possible and do not actually put care into completing the test.
ie, 1658/4660 workers passed this test, NOT 79 (!!)
Then later they describe some additional filtering they put in place to attempt to find the best workers (they tried estimated location and time to complete task). Based on these filters they said: Using a combination of pre-screening and the test tasks described above, only 79 workers of 4,660 applicants qualified to process real business changes.
I was at NIPS and talked to one of the authors. I thought the paper was interesting, but I think the "you're not paying enough" critique is spot on. Humans clearly can be better at this task---you just can't give them strong incentives to cut corners on quality, which happens with a low piece-rate and a task that takes on the order of 3 ~ 4 minutes to do properly.
Am I right in thinking that a naive Bayes classifier is beyond "not even the best out there," and is in fact about as simple a learning algorithm as you can get, and straight out of AI 101?
They're sometimes a good technique only because some problems are really simple. There are almost no problems where the extreme independence assumptions of naive Bayes create a reasonable likelihood function. The consequence ends up that when it's wrong, it tends to be very very certain that it's right. I think the aphorism that gets passed around is "Naive Bayes classifiers are often in error but never uncertain".
Yup. But some problems -- for instance, discriminating between spam and non-spam emails, and keeping up decent discrimination as spammers vary their tactics -- are (1) "really simple" in that sense and (2) apparently quite difficult to solve, given that there basically were no really effective spam filters before naive-Bayes ones came along.
We use a modified naive bayes extensively in a commercial application -- from what I understand it's extremely quick to classify, easy to modify/customize, and deals very well with gaps in data. For a lot of applications, things like SVM and WAODE are only minor incremental improvements.
Partly this is because naive Bayes's unreasonable independence assumptions (which are almost always badly violated) turn out not to actually hurt classification performance in a lot of cases, even in theory, because under a lot of distributions the independence violations basically cancel out: http://www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf
Does this indicate that the majority of Turkers are already just simple scripts? Perhaps just not as well adapted to particular problem sets as this custom-built one was.