Absolutely, it's hugely important. The core idea that subjective beings will disagree holds obvious weight, but it takes serious commitment to improving the process to offer up this kind of experiment to prove it.
Since NIPS is a very prestigious conference, I'd expect a lot of submitted papers (perhaps a vast majority of them) would fall in a grey area between "clearly unsuitable" and "clearly suitable". I personally think there are too many factors at play in evaluating a series of papers - no objective sorting can really exist.
A note for people outside the academic Machine Learning field: NIPS is widely believed to be on a different level from the rest - during my PhD, my advisor used to say that a NIPS paper would be on par with a paper published on a good journal, resume-wise. The difference is especially striking if you've had the chance to attend other conferences (including, alas, IEEE-sponsored events) - that are, with very few exceptions, fairly terrible from a scientific point of view.
Selecting talks really subjective. It's more like choosing what songs to play at a party than an objective sorting process.
You have limited speaking slots. You have to guess what the conference attendees will find interesting this year. You are biased by your own particular interests.
Also, some of the submitters are your friends/colleagues, and even if they didn't tell you what they were submitting already, which is unlikely since your relationship is based on talking about this stuff, you can tell it's theirs in less than 250 words...
However a conference tries to sell the fairness and objectivity of its process, you can't anonymize or double blind these things away.
For two independent committees, 6% of papers were acceptable without disagreement, 25% were rejectable while the rest were coin flips. This means when your paper gets accepted or rejected, luck is playing huge part. This is not because judges are actually flipping coin but vast majority of people don't seem strikingly good or bad. So for a repeated trial outcome may not be same. Also, the asymmetry here is striking. Definitely bad papers dominates in number by 4X than definitely good papers.
These are really great observations with deep implications. This same patterns might get applied in other aspects of life such as interviewing candidates or selection of mate or buying a shirt. In all these cases, we might have similar distribution at work.
I have often wondered why is it so hard to have less mediocrity in world? Why is not every book, t-shirt or smartphone is just great? One obvious reason is that lot of times people create something out of obligation such as demand from job instead of out of urge to create. So subsequent question is that if it was possible that no one has to have any obligation to create, can above distribution turn its head over hill? For example, in that scenario would we have, say, 70% great papers, 5% mediocre and rest coin toss?
"Why is not every book, t-shirt or smartphone is just great?"
Different people have different ideas of what "great" means. Not everyone thinks the Harry Potter series are great books, while many do. We see that in movies where a movie does poorly at the box office while the critics.
The definition of greatness changes over time, so "It's a Wonderful Life", now considered one of the most critically acclaimed films ever made, had only mediocre revenue when it came out.
Greatness is sometimes situational, so "Dan Brown ... is the undisputed king of airplane books — the not-too-heavy, not-too-long potboilers perfect for a long layover." If you don't fly, then perhaps there's no time when Brown's works might appeal.
Travel has its own category of "good enough." Visiting Germany once I bought a book from the limited English selection not because it was great, but because it was something to read on the long train ride.
A lot of people watch sports, but surely it can't be that all sports games are great, so greatness can't be the only reason for keeping someone's interest.
Since it's hard to predict greatness, people will test out ideas to see if there's a response. Sometimes this can lead to feedback and improvements. Sometimes this testing is through writing clubs. Sometimes (as with smartphone apps) this is with the market itself.
It's simpler than that. We focus on the differences between things, not the similarities. If all movies were equally good, we would then grow to focus on the tiny differences between them and start to judge them based on those.
This is the way "peer review" works. It is basically random. I have always found it comforting whenever I had a paper rejected as I would know it was nothing to do with the quality of my work. I would fix any of the typos found by the reviewers (you always get a spelling nazi as one of reviewers) and send it out again unchanged. I only have had one paper rejected twice and it was accepted unchanged on the third attempt.
My favourite peer review story is when I submitted one on my articles to the top journal in my field at the time (Appied and Enviromental Microbiogy). It came back with the usual peer review trivial changes (cite this irrelevant paper of mine,etc) which I did (this nearly always easier than arguing with the reviewers). The editor made a mistake and instead of sending the updated manuscript out to the original reviewers, they sent it out to a new lot of reviewers. What was funny about the whole exercise was the second set of reviewers called the first set of reviewers idiots and told me to change everything back.
This is true, but there is always exceptions. The second paper I published I sent off to the journal and after a couple of months I had not heard anything (this was in the physical paper days where you had to mail everything). My supervisor decided to call the editor to ask what was happening. The editor said "oh we published it last month". The whole paper had gone straight through without a single change. This of course was the last time I ever had a paper accepted like this :)
To be fair, calling it "Neural Information Processing Systems" isn't significantly more informative. The name is just a quirk of history; NIPS in its modern form includes research in all areas of machine learning, not just neural nets.
In fact for some years neural networks were very out of fashion there, and it was almost purely a statistical machine learning conference. I tend to just think of it as a machine-learning conference named "NIPS", which stands for something historical (like Perl and Lisp do).
I think people are reading more into this than there is. Reviewing papers is a highly subjective, high variance process, and very few papers get universally positive reviews.
From the point of view of an author, if you get a paper rejected that you know is worthwhile, you just have to make whatever improvements you can and then submit it again.
SIGMOD made an interesting move this year by accepting all papers reaching its standards. However not every paper will be given a presentation slot during the conference.
NIPS gets a ton of submissions, so the law of large numbers governs pretty strongly. Imagine that each paper submitted is independently either good or bad, with 22.5% probability of being good. With 1660 submissions, the total number of good papers follows a Binomial(1660, 0.225) distribution, which has mean 374 and standard deviation 17. Under this model, the fraction of good papers would be somewhere in the range 20.5-24.5% (corresponding to a two-standard-deviation window around the mean) in 95% of reviewing cycles. So even though the quality of the individual papers is totally random, the randomness mostly "cancels out" and the overall number of good submissions is relatively constant.
Of course this is assuming an objective standard for what constitutes a "good" paper. As others have pointed out, the only really meaningful standard is "how does this paper compare to other work being done in this field"? So it's also reasonable to think of NIPS's goal as just trying to present the best papers that were written in any given year, not as bestowing a strictly-defined stamp of objective quality.
When I've arranged conferences, we had a certain number of time slots. It's a bit flexible, in that we can decide to allocate a longer time for two talks, or shorter time for three, depending on the talks.
It could be that they had a first pass at a schedule, used that to set a first cut for the reviewers, then adjusted the schedule once they figured they needed to add another 42 papers.
Also, not being accepted does not mean that a paper is poor. They used a rank system, so it only means that others had papers which appeared to be better.
I am more curious of the inverse: what if more that 22.5% of papers were at acceptable quality levels? Wouldn't that leave each committee to pick and choose, thus artificially inflating their disagreements?
Yes, and I think that is essentially why we're seeing these disagreements.
I've heard from lots of professors that a good conference gets a lot of "very-good-but-not-great" submissions and the job of the program committee is to pick the best among these. I wouldn't be surprised at all if minor personal preferences (which from the outside look rather random) ended up having a big say in the fate of a particular paper. Maybe some reviewers are more forgiving of poorly-written but technically strong papers, maybe some reviews consider certain fields "dead" and so are biased against them, reviewers tend to wildly different standards on how extensive an experimental analysis should be to be acceptable, ...
Consider the demographic submitting to NIPS. It's a self-selected group within the top researchers in the world in that area. The best people in the field don't want to be seen publishing the so called "second-tier" conferences, so they will submit exclusively to the likes of NIPS. And if you're an up and coming researcher or research group, you will want to establish credibility by publishing in these sorts of venues, and you will almost surely send your best work there. Add to this the fact that this is a "hot" field, so more and more researchers and research groups are getting into the field and trying to publish papers, I think it's very likely that NIPS gets a lot more good papers than they can possibly accept.
What does "poor quality" mean? There is no absolute standard for quality. "Poor" is something like "less good than usual compared to the recent work in this community". So the top-scoring third-ish of papers sent to the currently-converged-on favourite venue of a community are pretty much by definition not poor. Unless something very weird indeed happens one year. There are usually only very few really excellent papers, though. Most papers are filler in retrospect.
Also, conferences need to accept a decent number of papers so that people will show up and cover the costs of the meeting. Venues are usually booked long before the program is fixed.
Ok, we've gone from "top tier venue, basically impossible to have a large fraction of poor papers submitted" to "Most papers are filler in retrospect" and "conferences need to accept a decent number of papers so that people will show up and cover the costs of the meeting". I guess if I am deciding whether or not to hire a professor I would be tempted to disregard publications in this conference.
Number of publications is a proxy for how much funding a professor can generate. Not much else.
> "Most papers are filler in retrospect" and "conferences need to accept a decent number of papers so that people will show up and cover the costs of the meeting"
None of these are conflicting. Conferences are often more about networking than the papers. Many paper are filler, but often only in retrospect. They are not obviously filler when presented.
> I would be tempted to disregard publications in this conference.
That was not something I suggested. NIPS is very good conference, and a paper there is suggestive of quality work. Lots of past NIPS authors have been aqui-hired or regular-hired by Google and Facebook recently in their machine learning spending sprees, for example.
I think its a bit harsh to call the papers "filler", but the reality is that most papers (in CS, anyway) are incremental work on important but well-studied problems or work on problems that are fairly narrow or not universally considered to be important. Reviewers tend to have wildly divergent opinions on how important or interesting that kind of work is.
The "in retrospect" was an important part of that point. Reviewers don't have access to it when reviewing.
Some conferences and journals have a retrospective prize for the best paper of, say, ten years ago. It's a neat way to recognize papers that turned out to be useful.
It's an XKCD styled graph by the looks of it. There are a couple generators out there that take data and make the hand drawn looking graphs. e.g http://xkcdgraphs.com/
I love the general idea of comic-style graphs, but this particular implementation of it does indeed look "fucking awful" in my opinion. The graph itself is fine, but the axis is made of these faux wavy line that eerily repeats itself.
The graph itself is fine, but the axis is made of these faux wavy line that eerily repeats itself.
I see the same things in fonts that are supposed to look "hand-drawn", and CG renders of realistic scenes - it looks "imperfect" but the way the "imperfection" is itself perfect is what stands out. A little randomness goes a long way to avoiding that.
Except that there really isn't imprecision in the estimates. The numbers are really quite precise and the confidence interval is actually very good. Far better than an XKCD style should be used for.
I use XKCD stuff for WAG's, not well-supported data.
The site guidelines explicitly ask you not to post questions like this in the threads, but rather to email them to hn@ycombinator.com. The minutiae of title editing are not on topic here.
IEEE may be more prone to precision errors (letting bad papers in) while NIPS may be prone to recall errors (throwing good papers out).
With the way reviewing is done (no one can take a week off to read and fully comprehend the four papers they are given) you cannot achieve perfect separation - even if that were possible.
Calling the process of "accepting a SciGen-generated paper into a allegedly peer-reviewed journal" a "precision error" is a bit on the optimistic side. It implies that someone was making a decision after reading the content of the paper, as opposed to, well, just accepting everything in sight.
It doesn't take a "week off" to notice that a paper is gibberish, at the very least.
That's a a really brave thing to do, and deserves serious credit.