Hacker News new | past | comments | ask | show | jobs | submit login
A cartel of influential datasets are dominating machine learning research (unite.ai)
260 points by Hard_Space on Dec 6, 2021 | hide | past | favorite | 71 comments



Too bad they don't cite the paper "The Benchmark Lottery" (M. Dehghani et al, 2021) (https://arxiv.org/abs/2107.07002)

> The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness of different algorithms and methods. This paper proposes the notion of "a benchmark lottery" that describes the overall fragility of the ML benchmarking process. The benchmark lottery postulates that many factors, other than fundamental algorithmic superiority, may lead to a method being perceived as superior. On multiple benchmark setups that are prevalent in the ML community, we show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks, highlighting the fragility of the current paradigms and potential fallacious interpretation derived from benchmarking ML methods. Given that every benchmark makes a statement about what it perceives to be important, we argue that this might lead to biased progress in the community. We discuss the implications of the observed phenomena and provide recommendations on mitigating them using multiple machine learning domains and communities as use cases, including natural language processing, computer vision, information retrieval, recommender systems, and reinforcement learning.

Edit: By "they", I was actually referring to the linked article. Strangely, even the paper the article is about does not cite "The Benchmark Lottery" at all.


It's also a shame that this paper in turn doesn't cite "Testing Heuristics: We Have It All Wrong" (J. N. Hooker, 1995) (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71....), which discusses these issues in much the same way. It's good to see in "The Benchmark Lottery", however, that they look more into specific tasks and their algorithmic rankings and provide some sound recommendations.

One thing that I'd add (somewhat selfishly as it relates to my PhD work), is the idea of generating datasets that are deliberately challenging for different algorithms. Scale this across a test suite of algorithms, and their relative strengths and weaknesses become clearer. The caveat here is that it requires having a set of measures that quantify different types of problem difficulty, which depending on the task/domain can range from well-defined to near-impossible.


I've been looking at parsing paragraph structure and have started thinking about a conceptual mechanical turk/e e cummings line in the sand where it's just going to be easier to pay some kid with a cell phone to read words for you. The working implementations I've seen are heavily tied to domain and need to nail down language, which isn't really a thing.

Quantification is fascinating, it seems to be something I take for granted until I actually want to make decisions. It's like I'm constantly trying to forget that analog and digital are two totally separate concepts. I wouldn't really recommend reading Castaneda to anyone but he describes people living comfortably with mutually exclusive ideas in their head walled off by context, and I'd like that sort of understanding.


Seems like a distinct issue to me, though there are obvious parellels and similar outcomes.


Because AIs are relatively narrowly focused, they suffer greatly from limited data sets. More often than not, an AI will simply memorize test data rather than "learning".

This impacts the benchmarking process, so I think it's relevant.


Hey, you're giving away my secret for passing AWS certification exams!


Agree, I was not saying it was redundant. But I would have expected a reference, as that paper was the first related item that came to my mind.


Ugh the title of the article.

Here's the paper that this article is recapping - "Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research", https://openreview.net/forum?id=zNQBIBKJRkd

Abstract: "Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field."

The reviews seem quite positive, if short. On a skim it looks very solid, offering an empirical look at the dynamics of benchmark usage that IMO seems unprecedented, so I'm not surprised it got positive reviews and accepted.


FWIW, this paper is being presented at NeurIPS this week.


The title is linkbait: There is no cartel.

The word cartel implies collusion. There is no collusion.

Using the word "cartel" in this case is not only wrong, but also... insulting to all the hardworking individuals who have toiled away in obscurity to compile, clean-up, label, package, and publish influential datasets.

Despite the terrible title, the OP is worth a read. It summarizes a recent research paper, also worth a read, about the disproportionate popularity in ML research of a relatively small number of datasets sponsored by and produced at elite institutions (e.g., ImageNet for visual recognition):

https://openreview.net/forum?id=zNQBIBKJRkd

Surely we can find ways to address the winner-take-all dynamics of benchmark popularity without unfairly accusing anyone of running a cartel.


> Surely we can find ways to address the winner-take-all dynamics of benchmark popularity

Reviewer 2 is probably the only/best way to address this. More-so if they're a grant reviewer, of course, but paper reviewers will do ok as well.


A little confused:

Are these big "dominant" institutions charging for this data? No, the spend a lot of resources putting them together and give them away free.

Are they preventing others from giving away data? No, but it costs a lot and they bear that cost.

Are they forcing smaller institutions to use their data? No. Its just a free resource they offer.

Do they get the grants themselves because they have some kind of proprietary access? No, the whole point is that the benchmark is open and everyone has access.

So they collect this data, vet it, propose their use for benchmarks and give it away free. What is the complaint? The problem is not even posed as "well, this data is overfitted in papers or we are solving narrower problems". [Edit: they sort of do, but leaving this here so the comment below makes sense as a follow up.]

No the complaint is that these handful of institutions are giving away free data so too many people use it? Can we have more "problems" of this nature?


> So they collect this data, vet it, propose their use for benchmarks and give it away free. What is the complaint?

It's well known that neural networks can easily inherit biases from their training data. It's also well known that datasets generated by western universities are widely used in training and evaluating neural networks.

If my training set is full of pictures of Stanford CS undergraduates, I could end up with a computational photography system that makes everyone look like Stanford CS undergraduates, or a historical photo colourisation system that makes everyone look like Stanford CS undergraduates, or a self driving car pedestrian tracking system that expects 90% of pedestrians to look like Stanford CS undergraduates.

And if the people who make the model say "Hey, our model's biases aren't our responsibility, we're just representing the training data as best we can" and the people who make the training data say "Hey, we never claimed it was perfect, you can take it or leave it" these problems might fall through the cracks.


> And if the people who make the model say "Hey, our model's biases aren't our responsibility, we're just representing the training data as best we can"

I don't think this excuse is like the others. If the model doesn't work well because they used bias data, it is the job of the people making the model to find better data (or manipulate the training process to overweight some data and attempt to counteract the bias).

I think the burden of responsibility has to be on the people who make models or put models into products to make sure the model is a good fit for the problem it is solving.


> It's well known that neural networks can easily inherit biases from their training data. It's also well known that datasets generated by western universities are widely used in training and evaluating neural networks. If my training set is full of pictures of Stanford CS undergraduates, I could end up with a computational photography system that makes everyone look like Stanford CS undergraduates, or a historical photo colourisation system that makes everyone look like Stanford CS undergraduates, or a self driving car pedestrian tracking system that expects 90% of pedestrians to look like Stanford CS undergraduates.

Why then aren't foreign universities/companies simply... building their own datasets?


>The problem is not even posed as "well, this data is overfitted in papers or we are solving narrower problems".

I mean it does get into that:

"They additionally note that blind adherence to this small number of ‘gold’ datasets encourages researchers to achieve results that are overfitted (i.e. that are dataset-specific and not likely to perform anywhere near as well on real-world data, on new academic or original datasets, or even necessarily on different datasets in the ‘gold standard’)."


True, I guess my complaint is that they don't get into the mechanisms.

The sibling comment on benchmark lottery paper lays this out. But I should modify.


I believe the author is suggesting that large institutions are doing this to earn extra citations - the currency of academia.

A large institution can make a dataset for X then browbeat other researchers into using X and citing X. Using X also likely leads to citations of derivative work by the lead institution.


Is the term "browbeat" fair here? I don't think they're making calls and saying "Oh, nice paper, but I notice you used this other dataset..." No, they're putting out good quality datasets that people want to use. If that earns them a citation, good for them. Its the least I can do for helping me test my algo.


It depends, feedback is rarely as polite as "I noticed you used this other dataset". The feedback would probably look like.

- "Nice paper, however the results are not relevant to current research due to the use of X dataset rather than Y or Z datasets score 2/5 do not accept."

- "Nice paper, however the results are of unknown quality due to the use of X dataset 3/5 recommend poster track".

In fact I'd generally say that most paper reviews would drop the first three words of those feedbacks. It's not an unreasonable assertion that progress is measured on standard datasets - but it's also necessary to push back on this.


Absolutely. Failure to report results on a popular benchmark suggests to some reviewers that you have something to hide - even though they might be computationally expensive or tangential to the main point of the work.


If a non-standard dataset is being used, I would expect there to be a discussion/analysis on what characteristics of that dataset made it unusable for this paper. Especially if a proposed model is being compared against models that were trained on those standard datasets.

If you are establishing new baselines using those same models on your non-standard dataset, then one would expect you to put in a good amount of effort to finetune all the knobs to get a reasonable result. If the authors put are able to put in that much effort, then that kind of feedback is definitely unreasonable.


>> If a non-standard dataset is being used, I would expect there to be a discussion/analysis on what characteristics of that dataset made it unusable for this paper.

Unfortunately that just adds more work for the reviewer, which is a motive for many reviewers to scrap the paper so they don't have to do the extra work.

That sounds mean, so I will quote (yet again) Geoff Hinton on things that "make the brain hurt":

GH: One big challenge the community faces is that if you want to get a paper published in machine learning now it's got to have a table in it, with all these different data sets across the top, and all these different methods along the side, and your method has to look like the best one. If it doesn’t look like that, it’s hard to get published. I don't think that's encouraging people to think about radically new ideas.

Now if you send in a paper that has a radically new idea, there's no chance in hell it will get accepted, because it's going to get some junior reviewer who doesn't understand it. Or it’s going to get a senior reviewer who's trying to review too many papers and doesn't understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that's really bad.

https://www.wired.com/story/googles-ai-guru-computers-think-...

Basically a new dataset is like a new idea: it makes the brain hurt, for the overburdened experienced researcher or inexperienced younger researcher alike. Testing a new approach on a new dataset? That makes brain go boom.

Which is a funny state of affairs. Not so long ago it used to be that one sure-fire way to make a significant contribution that would give your paper a leg up over the competition was to create a new dataset. I was advised as much at the start of my PhD (four ish years ago). Seems like this has already changed.


You might be right here. My comment was more of my expectation as a reader on what should be present in such a paper.


It takes substantial effort to build a good dataset, proportionally more if it gets bigger, and people like big datasets because you can train more powerful models from them. So I am not surprised that people tend to gravitate towards datasets made by well-funded institutions.

The alternative is either a small dataset that people heavily overfit (eg the MUC6 corpus that was heavily used for coreference at some point where people cared more about getting high numbers than useful results) or things like the Universal Dependencies corpus which are provided by a large consortium of smaller institutions


When someone used the word “overfitting” I usually take that to mean that a model has begun to enter the phase where further improvement is leading to lower generalization.

In fact, as far as I can tell, we are not overfitting in this sense. When I have seen papers examine whether progress on, let’s say, imagenet, actually generalizes to other categorization datasets the answer is yes.

What we have been seeing is that the slope of this graph is flattening out a bit. Whereas in the past a 1% improvement on imagenet would have meant a 1% improvement on a similarly collected dataset, nowadays it will be more like .5% (not exact numbers just using numbers to illustrate what I mean by diminishing returns.)

If an institution or a lab can show that progress on their dataset -better- predicts the progress on a bunch of other closely related tasks, then as researchers become convinced of that, they will switch over. Right now there isn’t a great alternative because it’s not easy to create such a dataset. Scale is critical.

Imagenet really was on the right track as far as collecting images of nearly every semantic concept in the English language. So whatever replaces it will have to be similarly thorough and broad.

In my opinion the biggest weakness of currently existing datasets is that they are typically labeled once per image with no review step. So I think the answer here isn’t

“Let’s get researchers to use smaller datasets from smaller institutions”

It would be more like

“We have to figure out a way to get a bigger, cleaner version of existing datasets and then prove that progress on those datasets is more meaningful”

The realistic way this plays out is that some institution in the “cartel” releases a better dataset and then lots of small labs try it out and show that progress on that dataset better predicts progress in general.


I disagree: some "collective overfitting" happens when everyone evaluates on the same dataset (and often, the same test set).

There's a neat set of papers by Recht et al. showing results are slightly overfit to the test partitions of ImageNet and CIFAR-10: rotating examples between the train and test partitions causes systems to perform up to 10-15% worse.

https://arxiv.org/abs/1806.00451 https://arxiv.org/abs/1902.10811

There's another neat bit of work involving MNIST. The original dataset (from the mid-90s) had 60,000 test examples, but the distributed versions that virtually everyone uses has only 10,000 test examples. Performance on these held-out examples is, unsurprisingly, a bit worse:

https://arxiv.org/pdf/1905.10498.pdf


I think it's super important to separate the following two situations, both of which I suppose are fair to call overfitting

Situation A: Models are slightly overfit to some portions of the test set. But the following holds

IF PerformanceOnBenchmark(Model A) > PerformanceOnBenchmark(Model B) Then PerformanceOnSimilarDaset(Model A) > PerformanceOnSimilarDataset(Model B)

Therefore progress on the benchmark is predictive of progress in general.

Situation B: The relation does not hold, and therefore progress on the benchmark does not predict general progress. This almost always happen if you train a deep neural network long enough: train performance goes up, but test performance goes down.

If you look at figure 2 of the first paper you sent, you will note that it shows we are in situation A and not situation B.

Situation A overfitting = diminishing returns on improvements on benchmark, but the benchmark is still useful. Situation B overfitting = the benchmark is now useless


>> "This almost always happen if you train a deep neural network long enough: train performance goes up, but test performance goes down."

This is a problem that is more common for classification problems, I think. Generative and self-supervised models (trained with augmentation) tend to just get better forever (with some asymptote) because memorization isn't a viable strategy.

I personally think image classification is mostly a silly problem to judge new algos on as a result, and leads to all kinds of nonsense as people try to extrapolate meaning from new results.


Nearly all useful machine learning is supervised, still. And if you are using a neural network, it will eventually memorize. This is fine though, we have early stopping :)


Eh, I work in audio ml, where most of the interesting and useful work is in conditioned generative models. Compression, TTS, source separation. The bias towards classification is a side effect of people starting at imagenet too long; I really think it's holding the field back.


Lately there have been amazing results on pairs of web text+image, no need for labelling. These datasets are hundreds of times larger and cover many more categories of objects. GPT-3 is also trained on raw text. I think ImageNet and its generation have become toy datasets by now.


This is what my dissertation was about, in a different field: Everyone using the same datasets, even though they are severely flawed.

As it turned out, differences between datasets proved significantly larger (by a big margin) than differences between algorithms. And the most popular datasets in fact included biases and eccentricities that were bound to cause such problems.

https://bastibe.github.io/Dissertation-Website/ (figures 12.9-12.11, if you're interested)


thank you! I also found the same thing during my PhD and it took a basic-pay-topublish paper and 2 years to get my supervisor to agree that this is not really going anywhere with publications and switch to a field, where I am more comfortable with producing publications. Essentially my field has 1 big dataset, and then domain experts "improve" models by creating their own data (200 samples) and then claim a "novel" method (which is not transferable at all).

The pay2publish-paper was sent back by 2 journals with reviews indicating exactly "our method is better, just use the right (our) thing" (which I _all_ refuted for the professor by doing it, but the editor wouldn't hear anymore...). And then there's papers predicting features through a complex preprocessing pipeline in these journals. Academia and big companies are just idiotic.


Funny that you would mention trying to publish such findings in journals. I tried... more than two times, as well. Rejected with the most spurious of claims, despite my presenting evidence disproving them. Frankly, these journal submissions were the most frustrating and trying experiences in my professional life.

Oh well. I'm glad I got my degree, and could leave academia relatively unharmed.

Now I work in AI (engineering!), where most science is highschool-level "hey, I tried stuff, and things happened. Dunno why, though". It's just ridiculous.


I think it's good that well-founded institutions publish well-crafted datasets for everyone else. It helps small teams develop their model.

The problem here is, though, most publishers don't accept papers if the new proposal isn't backed by benchmarks by the well-known datasets. Even though it can be a competitive approach for a specific field, they reject the paper anyway if it's not performed well on the datasets.


Do they really reject them though? In my experience publishers will publish anything, if anything they should have higher standards not lower.


For brand name journals and conferences, yes they do. My experience with those is a very high rejection rate.

There is a good reason to reject if you are not using a standard dataset. How can you compare the results of two approaches, to say natural language inference of whether one sentence entails another, without results being tested on the same dataset?

I think one thing overlooked in the conversation is that many papers start with a standard baseline and then use another dataset to establish additional results.

In my experience in nlp also, journals and conferences tend to establish datasets of their own when they make a call for submission. Often these are called the "shared task" track. ACL has operated this way for decades.


It's a pattern matching problem. It takes a lot of work and too many unknowns when dealing with "new benchmark sets" or ... just any other set.


This article is completely disconnected from what is happening in the machine learning community.

Granted, datasets have grown larger and larger over time. This concentration is actually a healthy sign of a community maturing and converging towards common benchmarks.

These datasets are open to all for research and have fueled considerable progress both in academia and industry.

The authors would be well-advised to look at what is happening in the community. Excerpts from the program of NeurIPS happening literally this week:

Panels: - The Consequences of Massive Scaling in Machine Learning - The Role of Benchmarks in the Scientific Progress of Machine Learning - How Copyright Shapes Your Datasets and What To Do About It - How Should a Machine Learning Researcher Think About AI Ethics?

All run by top-notch people coming from both academia and industry, and from a variety of places in the world.

I am not saying that everything is perfect, but this article paints a much darker picture than needed.


Somewhat related to this: What methods do people normally use to measure the quality of a dataset?

For example, if there are 50 datasets of historical weather data how can I determine which one is garbage?


I would say it's garbage if there's a paper like this: https://arxiv.org/abs/1902.01007


That's like hard negative mining. You can also train a weak model and filter out the examples it manages to predict correctly to come up with a harder dataset.


That looks interesting.

Any more papers on the subject that you can recommend?


At the moment, no, but this could be helpful: https://www.connectedpapers.com/main/42ed4a9994e6121a9f325f5...


So, like humans, training an AI takes lots of curated data (education, journalism, political savvy and avoidance of bias or corruption). It took a very long time to get that corpus ready and available for humanity (say 10,000 years).

Now we have exploded that corpus to include everything anyone says online, and we are worried that the lack of curation means we cannot be sure what the models will come back with.

Its rather like sending our kids out to find an education themselves, and finding three come back with nothing, two spent years learning from the cesspit of extremism, two are drug addicts hitting "more" and one stumbled into a library.

Just a thought but I dont think journalism is really about writing newspaper articles. It really is about curating the whole wide world and coming back with "this is what you need to know". Journalism is the curation AI needs...


Kids (and GPT) say the darndest things...


I come from the econometrics end of things and Paul Romer (Nobel Prize winner) put it well:

'A fact beats theory every time...And I think economists have focused on theory because it's easier than collecting data.'

When I read this article this is exactly what I thought of -- modeling/researchers always focus on the low hanging fruit which is sitting in their comfy chair in their ~200 year old university developing hyper-complicated models rather than going out and collecting data that would answer their questions.


>> According to the paper, Computer Vision research is notably more affected by the syndrome it outlines than other sectors, with the authors noting that Natural Language Processing (NLP) research is far less affected. The authors suggest that this could be because NLP communities are ‘more coherent’ and larger in size, and because NLP datasets are more accessible and easier to curate, as well as being smaller and less resource-intensive in terms of data-gathering.

I'll have to read the paper to see what exactly it says on this but my knowledge of NLP benchmark datasets is exactly the opposite: the majority are simply no use for measuring the capabilities they're supposed to be measuring. For example, natural language _understanding_ datasets are typically created as multiple-choice questionnaires, so that, a) there is already a baseline accuracy that a system can "hit" just by chance (but which is almost never noted, or compared against, in papers) and b) a system that's good at classification can beat the benchmarks black and blue without doing any "understanding". And this is, indeed, what's been going on for quite a while now with large language models that take all the trophies and are still dumb as bricks.

To make matters worse, NLP also doesn't have any good metrcis of performance. Stuff like the BLEU scores are just laughably inadequate. NLP is all bad metrics over bad benchmarks. And NLP results are much harder to just "eyball" than machine vision results (and the models are much harder to interpret than machine vision models where you can at least visualise the activations and see... something). I think NLP is much, much worse than machine vision.


Your suspicion is correct. I worked on such a dataset paper and worked to "beat" other methods on well-accepted benchmarks with dubious accuracy scores. One fundamental issue is that outside of POS tagging, there isn't a notion of empirical truth to measure against, only a small sample of what a "normal person would think." This stands in contrast to computer vision, whereby in a task such as monocular depth perception from a single frame, you can always measure against a Lidar-scanned depth map. The system can still overfit on the benchmark and they do, but at least the baseline truth itself is not in dispute. But questions such as: is this the "appropriate" response to a query is too open to interpretation.


Well, I recently released a NLP dataset of almost 200K documents and I got 4 whole citations in a year. Wish I could find a way to join this cartel and get someone to use it or even care

https://paperswithcode.com/dataset/debatesum

https://huggingface.co/datasets/Hellisotherpeople/DebateSum

https://scholar.google.com/citations?user=uHozRV4AAAAJ&hl=en


This paper the article refers to is fantastic! I think it's a work most in ML research should become familiar with. And if you believe in the power of benchmarks and data, then this holds even more true. Investing in diversity in datasets is likely an impactful way to make progress in AI/ML.

Minor typo in this article...

ARTICLE: Among their findings – based on core data from the Facebook-led community project Papers With Code (PWC) – the authors contend that ‘widely-used datasets are introduced by only a handful of elite institutions’, and that this ‘consolidation’ has increased to 80% in recent years.

...but right after, they quote the paper and clearly it is 50% not 80%. See the quote from the paper:

PAPER: ‘[We] find that there is increasing inequality in dataset usage globally, and that more than 50% of all dataset usages in our sample of 43,140 corresponded to datasets introduced by twelve elite, primarily Western, institutions.’

...and the article is leaving out this relevant quote from the paper:

PAPER: Moreover, this concentration on elite institutions as measured through Gini has increased to over 0.80 in recent years (Figure 3 right red). This trend is also observed in Gini concentration on datasets in PWC more generally (Figure 3 right black).

...and in general the article is right that inequality is increasing over time, but Gini is a specific metric to measure inequality, and 0.80 is not the same as 80% inequality.


I'm extremely surprised to see CUHK on the list. I'm living 10 minutes from them and never knew they are a big player in the ML scene. A brief search online didn't show up anything interesting apart from the CelebA dataset.

Edit: After asking a friend seems most of their research is with Chinese official research institutes and Sensetime. It makes sense now.


Also interesting in this context: this recent paper [1] analyzed some of the most popular datasets and found numerous label errors in the test sets. For example, they estimated at least 6% of the samples in the ImageNet validation set were misclassified, with surprising consequences when comparing model performances using corrected sets.

[1] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, https://arxiv.org/pdf/2103.14749.pdf


Just another example of academic conformity.


If there was a dataset that solved a problem sufficiently (perception, self driving, whatever), there would be no more need for any study of that field (other than for marginal improvements).

Once solved, these no longer are research areas. Getting X score on a Y validation set means you get your s/w drivers license.


The word "cartel" is too negative, especially when discussing political and social factors. The narrative seems written, before their findings. Or at least, it could well have.

For a computer science analogy: It is a paper with the finding that most successful computer languages are created at prestigious institutes. An obvious -not a bad - finding. Not like you could give the motivation, skills, expertise, resources, and time to a small new institute, and expect these to come up with a new language which the community will adopt.

Yes, if you write and publish a good data set, and it gets adopted by the community, then you gain lots of citations. This reward is known, and therefore some researchers expend the effort of gathering and curating all this data.

It is not a "vehicle for inequality in science". Benchmarks in ML are a way to create an equal playing field for all, and allows one to compare results. Picking a non-standard new benchmark to evaluate your algorithm is bad practice. And benchmarks are the true meritocracy. Beat the benchmark, and you too can publish. No matter the PR or extra resources from big labs. It is test evaluation that counts, and this makes it fair. Other fields may have authorities writing papers without even an evaluation. That's not a good position for a field to be in.

> The prima facie scientific validity granted by SOTA benchmarking is generically confounded with the social credibility researchers obtain by showing they can compete on a widely recognized dataset

Here, authors pretend social credibility of researchers has any sway. There is no social credibility for a Master's student in Bangladesh, but when they show they can compete, then they can join and publish. Wonderful!

Where the authors use the long history of train-test splits, to pose the cons have outweighed the benefits, they should reason more and provide more data to actually show this and get the field to get along. Ironically, people take more note of this very paper, due to the institution affiliation of the authors. I do too. If they had a benchmark, I would have first looked at that.

> Given the observed high concentration of research on a small number of benchmark datasets, we believe diversifying forms of evaluation is especially important to avoid overfitting to existing datasets and misrepresenting progress in the field.

I believe these authors find diversity important. But for overfitting, these should look at actual (meta-) studies and data. This seems conflicting. For instance:

> A Meta-Analysis of Overfitting in Machine Learning (2019)

> We conduct the first large meta-analysis of overfitting due to test set reuse in the machine learning community. Our analysis is based on over one hundred machine learning competitions hosted on the Kaggle platform over the course of several years. In each competition, numerous practitioners repeatedly evaluated their progress against a holdout set that forms the basis of a public ranking available throughout the competition. Performance on a separate test set used only once determined the final ranking. By systematically comparing the public ranking with the final ranking, we assess how much participants adapted to the holdout set over the course of a competition. Our study shows, somewhat surprisingly, little evidence of substantial overfitting. These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.


> Among their findings – based on core data from the Facebook-led community project Papers With Code (PWC)

Oh boy. PWC is not even close to a representative sample of what datasets are being used in papers. It's also often out of date.


When you cannot analyze every paper ever published you need to have some relevant criterion for inclusion -- and PWC is somewhat of a GitHub of AI science (at least in early days of it) -- there may be multiple collections, but PWC is by far the most accessible and thus reasonably captures current and emerging trends some dataset-heavy research fields.


But because the effort of adding a new dataset is more than a new paper, this dataset will systemically overestimate the number of papers that include the most popular datasets.


With increasing evidence that self-supervision works for multiple tasks and architectures, and sim2real/simulated data being used in industry I do not see this as an important concern.


I think they are highly overestimating how much science advanced pattern matching will provide. Certainly no conceptual understanding will ever come from that.


Advanced pattern matching is called science and has been how humans have made progress by taking data and writing down models for them. Now computers do it. End to end algorithms are 'black box' but machine learning in general is much broader than that and is providing understanding in many fields. Sorry you only see/know the "draw a box around a pedestrian' stuff, but try not to judge entire fields based off of limited exposure.


I think you are highly underestimating how much science and math comes mainly from advanced pattern matching.

Most stuff is proven by using advanced pattern matching as "intuition", and then breaking problems down into things we can pattern match and prove.

I'm not sure what conceptual understanding you think isn't pattern matching. They are just rules about patterns and interactions of patterns. These rules were developed through pattern matching.


Cartel, really?


Right? I thought so too, but this is the first definition from the American heritage dictionary:

> A combination of independent business organizations formed to regulate production, pricing, and marketing of goods by the members.

And it does seem to apply ¯\_(ツ)_/¯


They're not regulating anything. They just make the best datasets and those are the ones that get used


Legal regulation, no, but in the sense of having groups of internal people approving the release of data, data cleanup processes, data input, etc etc etc... yes, it's 100% regulated. At least in the cybernetics sense.


That's true of all organizations that release data. They're regulating their own data. They're not regulating the use of that data


Since we're talking about this in a definitional context you can just as easily argue that a drug cartel doesn't regulate the use of their drugs..


A cartel regulates a market. Drug cartels regulate the buying and selling of drugs. This "cartel" doesn't regulate any market




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: