Incredible how much benefit alphafold has brought. And all of that from a less than 100million parameters model.
I might be dumb but could they scale it up and make an alphafold 3 with maybe like 10bln params? Would it be a lot better assuming the same training effort is put into it?
If it does, can't biotech companies just go nuts and make a 100bln params internal model and have all the protein structures they want?
Has there been much research into the idea of distributing these large models across many heterogeneous machines?
I'm wondering if there could be a path towards a mix of alphafold and folding@home, with donated idle compute resources being used to train/run the models.
Designing for that sort of fragmentation could also make it easier to slowly run oversized models on local machines with swapped memory.
There was great work last year on distributed training and inference:
Petals: Collaborative Inference and Fine-tuning of Large Models
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel
https://arxiv.org/abs/2209.01188
Would be wonderful to use this host this model LLaMA: Open and Efficient Foundation Language Models https://arxiv.org/abs/2302.13971
Using any current architecture it is infeasible to do backprop (training) due to the massive communication requirements. Inference is possible to do in sharded way but still not as practical as just loading the model weight that are needed on-demand from disk; still a distributed job queue being processed may be beneficial depending on the throughput/costs required by researchers.
I think volunteer folding will always need to compete with crypto mining, and the appeal may be further diminished if commercial interests are likely to reap the rewards.
Perhaps if there was a way to combine mining and folding, to allow participants to somehow gain a share of the output? Eg each folded protein would have a unique hash, which could then be traded?
And yes, I hate everything about what I just typed.
I tend to trust Semantic more than GS. GS tends to overestimate. For example on GS I have 164 citations on one paper and semantic says 150. FWIW Scite says 49.[1]
[1] I'll note that this paper is an arxiv paper and has not been accepted at a conference but I'd also argue that conference acceptance means little in ML. I'll explain if anyone is actually concerned with the claim.
First, this peer review via conferences/journals/etc is relatively new in the scientific process. Really only the last 50 years has this paradigm been the main way for publishing. Prior to that scientists have just published in the open and and peer review happened by peers reading and responding. Not too different from what we see with arxiv, twitter, and blogging.
Second, we need to talk about how good the review process actually is. There's been a lot of writing on the NeurIPS experiments [0] is the most famous one. But the Google paper[1] notes that reviewers are "good at identifying bad papers but not good at identifying good papers." I'll go a step further than them and suggest a plausible model that makes this statement true: reviewers are reject happy. We need a confusion matrix to really see this but if you reject every paper you'd have a 100% success rate of rejecting bad papers but a 0% success rate of approving good papers. We have a good demonstration that ML conferences (journals aren't our priority like other academic areas, conferences are. This is an oddity) are an extremely noisy process and not very meaningful.
So how do we capture a signal in this noisy process? Citations are at least some signal. Obviously this isn't a fantastic signal either because big labs and companies are going to be able to popularize their work more and this will get more citations. But this still isn't any worse than we were 100 years ago. I'd argue that the noisy process of conferencing is worse than where we were 100 years ago (democratization of science aside).
Unfortunately, the only way to identify if a paper is good is to have experts evaluate them. I don't think we have a good alternative for this and adding significantly noisy signals aren't helpful.
Citation counts are always a bit arbitrary. Google Scholar usually overestimates, because it's basically a bunch of heuristics. Curated citation databases underestimate in the name of consistency. For example, they may ignore citations in conference proceedings, as conference papers are not considered legitimate publications in most fields.
We used the Google Scholar counts - and in particular a snapshot of end of february for this ranking. There is no perfect number, but at least this one is generally accepted as reasonable, and public, so easy to verify for everyone.
Of course interesting things happen outside of mainstream deep learning. The problem is that sorting by citation count is a terrible way to look broadly at research, because mainstream deep learning is so hyped up that there are just way more people working in that area than in other areas, and with more people come more citations.
It's a shame because I think that other areas offer much more interesting and technical questions, but many new researchers are only exposed to mainstream deep learning because the enormous hype drowns out everything else.
To be clear: I think the hype around deep learning is 100% justified (in fact I think it is underplayed).
But in terms of groundbreaking I do think "Zero-Knowledge Proofs for Machine Learning"[1] from 2020 has the potential to unlock some really revolutionary applications in the sense of "things that were not really possible without it".
There have been strong deep-learning contributions to control, but I wouldn't say control is dominated by deep learning at this point. For instance, take a look at ICRA 2022 awards for "Outstanding Dynamics and Control" and "Outstanding Locomotion".
The most interesting AI thing happening outside deep learning is zero-knowledge techniques.
They are very nascent though and it isn't surprising they aren't yet highly cited since it's quite an achievement getting anything working at all still.
Allows you to organize and share your paper collections, get personalized reading recommendations based on that, read papers and annotate the PDFs, and ask questions to papers using GPT.
Probably the only subfield of Computer Science, maybe even academic research in general, where citations number is the most poor metric of paper quality. The whole situation is more akin to mass media reporting the same breakthrough in cancer cure research (in mice).
According to Fig 3, Google was that most-cited problem domain. The most cited papers are for Google problems. You must have Google money and compute to achieve top-cited research.
Not sure why this is being downvoted, this is my takeaway too. The cutting edge of AI requires extreme computing resources, the likes of which a few universities and private companies have and doesn't really exist outside of that. Obviously not all cutting edge resources requires petaflops of compute, but much of the happening-now does, and that means that the few players with those kinds of resources will dictate what research is getting done.
Citations are of course only a distant proxy for real world impact. What would you recommend as a better early indicator? We also looked at Tweet count, but that seems more biased towards marketing and hype, and in fact does not even correlate very well with citation count.
Also the post conflates citations with impact. Citations is a measure of popularity, but impact is subjective. Which has more impact: stable diffusion or research that can better diagnose cancer? The answer to that could be argued either way.
I might be dumb but could they scale it up and make an alphafold 3 with maybe like 10bln params? Would it be a lot better assuming the same training effort is put into it?
If it does, can't biotech companies just go nuts and make a 100bln params internal model and have all the protein structures they want?