The Most Cited AI Papers in 2022

tomohelix · on March 5, 2023

Incredible how much benefit alphafold has brought. And all of that from a less than 100million parameters model.

I might be dumb but could they scale it up and make an alphafold 3 with maybe like 10bln params? Would it be a lot better assuming the same training effort is put into it?

If it does, can't biotech companies just go nuts and make a 100bln params internal model and have all the protein structures they want?

tiedieconderoga · on March 5, 2023

Has there been much research into the idea of distributing these large models across many heterogeneous machines?

I'm wondering if there could be a path towards a mix of alphafold and folding@home, with donated idle compute resources being used to train/run the models.

Designing for that sort of fragmentation could also make it easier to slowly run oversized models on local machines with swapped memory.

zavrel · on March 6, 2023

There was great work last year on distributed training and inference:

Petals: Collaborative Inference and Fine-tuning of Large Models Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel https://arxiv.org/abs/2209.01188

Would be wonderful to use this host this model LLaMA: Open and Efficient Foundation Language Models https://arxiv.org/abs/2302.13971

f_devd · on March 5, 2023

Using any current architecture it is infeasible to do backprop (training) due to the massive communication requirements. Inference is possible to do in sharded way but still not as practical as just loading the model weight that are needed on-demand from disk; still a distributed job queue being processed may be beneficial depending on the throughput/costs required by researchers.

greggsy · on March 6, 2023

I think volunteer folding will always need to compete with crypto mining, and the appeal may be further diminished if commercial interests are likely to reap the rewards.

Perhaps if there was a way to combine mining and folding, to allow participants to somehow gain a share of the output? Eg each folded protein would have a unique hash, which could then be traded?

And yes, I hate everything about what I just typed.

ragebol · on March 6, 2023

There's https://stablehorde.net/ running inference only afaik.

anshumankmr · on March 6, 2023

>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale >https://arxiv.org/abs/2010.11929

Very interesting. I didn't think of connecting these two dots. Let's see if it can be applied to other CV tasks, like object detection.

choppaface · on March 6, 2023

see e.g. DETR https://arxiv.org/abs/2005.12872

firechickenbird · on March 6, 2023

Very interesting, although I was expecting the Stable Diffusion paper[1] to be one of the most cited in 2022

[1]: High-Resolution Image Synthesis With Latent Diffusion Models, https://arxiv.org/abs/2112.10752

zavrel · on March 6, 2023

The arXiv version is from 2021 so it is not counted towards 2022.

elektor · on March 5, 2023

I'm getting wildly different citation counts for some of the listed AI papers.

For example, the paper "ColabFold: making protein folding accessible to all" is listed as having 1162 citations. I'm seeing that it was cited only by 899 publications on Scite: https://scite.ai/reports/colabfold-making-protein-folding-ac...

I'm wondering if Google Scholar is overestimating or Scite is underestimating.

godelski · on March 5, 2023

Semantic Scholar has 1,111. [0]

I tend to trust Semantic more than GS. GS tends to overestimate. For example on GS I have 164 citations on one paper and semantic says 150. FWIW Scite says 49.[1]

[0] https://www.semanticscholar.org/paper/ColabFold%3A-making-pr...

[1] I'll note that this paper is an arxiv paper and has not been accepted at a conference but I'd also argue that conference acceptance means little in ML. I'll explain if anyone is actually concerned with the claim.

dr_dshiv · on March 6, 2023

Please explain!

fxtentacle · on March 6, 2023

Facebook wav2vec2, for example, was also just uploaded to arxiv and then they formally published it months/years later after it was widely used.

The traditional publication flow just isn't useful if a runnable demo on HuggingFace explains your work way better than 3 pages of formulas.

godelski · on March 6, 2023

So there's a few things to consider:

First, this peer review via conferences/journals/etc is relatively new in the scientific process. Really only the last 50 years has this paradigm been the main way for publishing. Prior to that scientists have just published in the open and and peer review happened by peers reading and responding. Not too different from what we see with arxiv, twitter, and blogging.

Second, we need to talk about how good the review process actually is. There's been a lot of writing on the NeurIPS experiments [0] is the most famous one. But the Google paper[1] notes that reviewers are "good at identifying bad papers but not good at identifying good papers." I'll go a step further than them and suggest a plausible model that makes this statement true: reviewers are reject happy. We need a confusion matrix to really see this but if you reject every paper you'd have a 100% success rate of rejecting bad papers but a 0% success rate of approving good papers. We have a good demonstration that ML conferences (journals aren't our priority like other academic areas, conferences are. This is an oddity) are an extremely noisy process and not very meaningful.

So how do we capture a signal in this noisy process? Citations are at least some signal. Obviously this isn't a fantastic signal either because big labs and companies are going to be able to popularize their work more and this will get more citations. But this still isn't any worse than we were 100 years ago. I'd argue that the noisy process of conferencing is worse than where we were 100 years ago (democratization of science aside).

Unfortunately, the only way to identify if a paper is good is to have experts evaluate them. I don't think we have a good alternative for this and adding significantly noisy signals aren't helpful.

[0] https://blog.mrtz.org/2014/12/15/the-nips-experiment.html

[1] https://arxiv.org/abs/2109.09774

jltsiren · on March 5, 2023

Citation counts are always a bit arbitrary. Google Scholar usually overestimates, because it's basically a bunch of heuristics. Curated citation databases underestimate in the name of consistency. For example, they may ignore citations in conference proceedings, as conference papers are not considered legitimate publications in most fields.

zavrel · on March 6, 2023

We used the Google Scholar counts - and in particular a snapshot of end of february for this ranking. There is no perfect number, but at least this one is generally accepted as reasonable, and public, so easy to verify for everyone.

usgroup · on March 5, 2023

Did nothing of note happen outside of deep learning the whole year?

mtcrawshaw · on March 5, 2023

Of course interesting things happen outside of mainstream deep learning. The problem is that sorting by citation count is a terrible way to look broadly at research, because mainstream deep learning is so hyped up that there are just way more people working in that area than in other areas, and with more people come more citations.

It's a shame because I think that other areas offer much more interesting and technical questions, but many new researchers are only exposed to mainstream deep learning because the enormous hype drowns out everything else.

agolio · on March 5, 2023

Disagree, or can you name some recent AI developments outside of DL that are anywhere near as groundbreaking?

nl · on March 6, 2023

To be clear: I think the hype around deep learning is 100% justified (in fact I think it is underplayed).

But in terms of groundbreaking I do think "Zero-Knowledge Proofs for Machine Learning"[1] from 2020 has the potential to unlock some really revolutionary applications in the sense of "things that were not really possible without it".

[1] https://dl.acm.org/doi/abs/10.1145/3411501.3418608

adityamwagh · on March 6, 2023

Why is it under a paywall? :(

dadoomer · on March 6, 2023

As far as I know, the field of robotics sees plenty of innovation with no deep learning involved.

Whether you count robotics as AI is up to debate, I suppose

melony · on March 6, 2023

Any real innovation on the mechatronics side? The computer vision and controls are now increasingly dominated by deep learning.

dadoomer · on March 6, 2023

There have been strong deep-learning contributions to control, but I wouldn't say control is dominated by deep learning at this point. For instance, take a look at ICRA 2022 awards for "Outstanding Dynamics and Control" and "Outstanding Locomotion".

https://icra2022.org/program/awards

nl · on March 6, 2023

The most interesting AI thing happening outside deep learning is zero-knowledge techniques.

They are very nascent though and it isn't surprising they aren't yet highly cited since it's quite an achievement getting anything working at all still.

agolio · on March 5, 2023

In AI? No, the other stuff all happened in the 70s

phodo · on March 6, 2023

What are some of the best tools people are using these days to consumer research papers? (outside of clicking on the PDF)

zavrel · on March 6, 2023

Zeta Alpha, of course ;-)

https://search.zeta-alpha.com/

zavrel · on March 6, 2023

Allows you to organize and share your paper collections, get personalized reading recommendations based on that, read papers and annotate the PDFs, and ask questions to papers using GPT.

antegamisou · on March 5, 2023

Probably the only subfield of Computer Science, maybe even academic research in general, where citations number is the most poor metric of paper quality. The whole situation is more akin to mass media reporting the same breakthrough in cancer cure research (in mice).

charcircuit · on March 5, 2023

These tweet counts are not accurate. I searched for one with 0 tweets and found plenty of tweets mentioning it.

zavrel · on March 6, 2023

That's right, we only counted the tweets for arXiv papers (base on URL), not for the other publication venues. Looking into how to improve that.

zavrel · on March 6, 2023

More discussion around this post on twitter:

https://twitter.com/ZetaVector/status/1631590029926494211?s=...

bryanrasmussen · on March 5, 2023

It looks like it might be more interesting on what industries or problem domains those most cited AI papers were focused.

choppaface · on March 5, 2023

According to Fig 3, Google was that most-cited problem domain. The most cited papers are for Google problems. You must have Google money and compute to achieve top-cited research.

TchoBeer · on March 6, 2023

Not sure why this is being downvoted, this is my takeaway too. The cutting edge of AI requires extreme computing resources, the likes of which a few universities and private companies have and doesn't really exist outside of that. Obviously not all cutting edge resources requires petaflops of compute, but much of the happening-now does, and that means that the few players with those kinds of resources will dictate what research is getting done.

panabee · on March 5, 2023

very useful but flawed.

the list biases against papers published later in the year.

this is a problem raised by openai and brain researchers: https://twitter.com/_jasonwei/status/1631935794301771777

zavrel · on March 6, 2023

Citations are of course only a distant proxy for real world impact. What would you recommend as a better early indicator? We also looked at Tweet count, but that seems more biased towards marketing and hype, and in fact does not even correlate very well with citation count.

SequoiaHope · on March 6, 2023

Also the post conflates citations with impact. Citations is a measure of popularity, but impact is subjective. Which has more impact: stable diffusion or research that can better diagnose cancer? The answer to that could be argued either way.

drdunce · on March 6, 2023

Far more from industry than academia.