What's next for AlphaFold and the AI protein-folding revolution

fabian2k · on April 13, 2022

I found it interesting that AlphaFold can't reliably predict the structures for mutations that disrupt structure. The explanation makes a lot of sense though.

It is sometimes important to remind oneself that the selection of protein structures that exist in nature and that we determined experimentally is biased. Nature doesn't like proteins that misfold because they can easily cause trouble. And proteins with less defined structures are generally harder to solve with the usual methods like X-ray crystallography. The list of protein structures we know isn't a representative sample of all possible protein structures, it's mostly structures that are useful in nature and that we can determine with the methods we have available.

alan-hn · on April 13, 2022

>proteins with less defined structures are generally harder to solve with the usual methods like X-ray crystallography

What do you mean by 'proteins with less defined structures'? I'm not familiar with what this phrase could mean, could you please expand on this concept?

fabian2k · on April 13, 2022

Less defined means flexible in this case. So either parts that are completely random on their own, or parts that can adopt multiple different structures.

There are also intrinsically disordered proteins that have no defined structure when they are on their own, that's essentially like a piece of string that is almost completely flexible. Those proteins can still adopt a specific well-defined structure if they bind to something else.

alan-hn · on April 13, 2022

So does flexible mean that there may be different amino acids in a portion of the peptide? From my understanding, when flexibility is discussed in terms of proteins we're talking about rigid vs flexible side chains which can move or rotate along specific bonds

So for the intrinsically disordered ones, are you mainly talking about the secondary or tertiary structures? My assumption based on your statement is that we're keeping the same primary structure (order of amino acids) but they don't have many (if any at all) intermolecular interactions? Would it be safe to assume that you're referring to shorter polypeptide rather than large proteins?

dekhn · on April 13, 2022

in disordered proteins, there is no permanent tertiary structure. they may have some secondary structure, but the relations of those structural elements can change in time. It does not mean the seuqence has variation in it.

alan-hn · on April 13, 2022

Does this mean that they have multiple conformational states with similar energies that are easy for it to transition between? How different are the states and is this how the protein normally does its proteiny stuff?

f38zf5vdt · on April 13, 2022

Yes, many proteins have transitional global arrangements that it traverses as it meets some goal. For example, kinesin and dynein walk along microtubules in a way where we could never perfectly characterize the intermediary states since it's effectively a motor with free rotation around certain elements.

A lot of crystallography is focused on enzymatic reactions where you bind a ligand that sits there for the sake of introducing some conformation that you can study. The ligands generally approximate the natural substrate at either the beginning, end, or some intermediate step in enzyme catalyzed synthesis.

dekhn · on April 13, 2022

yes, I would say that intrisnically disordered proteins adopt something like an unfolded state, which is to say that they can visit a wide range of structures that are at similar energy levels, all of which are accessible at ~room temp. I can't really answer in more detail because all the ID proteins are fairly different an dhow they do their job is hard to understand compared to stable static "rocks" like enzyymes.

throwawaybio3 · on April 13, 2022

Enzymes aren't stable and static -- usually in their active site they have significant conformational changes that enable catalysis of the relevant chemical reaction. It's quite a problem that we don't have general robust ways of directly elucidating those transient structures, a lot of our understanding of catalysis is still held back or slow-evolving because we can only use indirect and cumbersome methods (like isotopic mutation + laser IR)

I would consider most enzymes to be intrinsically disordered at their active sites.

dekhn · on April 13, 2022

No enzymes are not intrinsically disordered at their active sites. They are highly ordered. Most enzymes don't undergo large changes- they accept a molecule, do their business, and release it. You're thinking of other proteins like motor proteins which under go large, controlled conformational changes.

The active site is structured to stabilize the transition state of the affected molecule and move it from one state to the next in the chemical reaction. That requires very specific shapes and correlated changes. But of course, this being biology, you can remove all 3 active site residues in a serine protease catalytic triad, and still see proteolysis because the protein, when it binds the substrate, forces the subtrate into its transition pathway.

People have been working on these things for quite some time- I saw talks about time-resolved crystallography of active sites, and while they say "significant structure changes", they really only mean localized breathing-like motions, not massive rotations of entire domains.

panabee · on April 13, 2022

is it possible to identify which proteins are intrinsically disordered based on amino acid sequence alone (or even base sequence)?

put another way, is it possible to a priori determine if a protein is ID or ordered?

for instance, you said enzymes are highly ordered. is this based on experimental observations (which could later be wrong if imaging techniques improve) or is there some principle that allows us to treat this as a fact?

thanks in advance for your time.

flobosg · on April 13, 2022

> is it possible to a priori determine if a protein is ID or ordered?

There’s software that attempts to predict intrinsic disorder based on sequence alone, but in general, in the absence of homolog (evolutionarily related) proteins with known structure you would still need to check experimentally for disorder.

EDIT:

> if the goal is to reliably assess certain viral proteins as ID or ordered, experimental methods are the only methods for achieving this?

If you don’t find homologs with solved structures, experimental characterization is the way to go.

panabee · on April 13, 2022

thanks for the explanation. to clarify, if the goal is to reliably assess certain viral proteins as ID or ordered, are experimental methods the only methods for achieving this?

dekhn · on April 13, 2022

A priori? No. Typically this would be determined by synthesizing or expressing the protein of interest and then using something like CD (circular dichroism).

There is an absolutely enormous amount of experimental data about enzyme structure, but frankly I think the simplest is to just understand that the modern ideas about the reversible protein folding process came from ribonuclease, a protein that cuts RNA: https://en.wikipedia.org/wiki/Anfinsen%27s_dogma

There may also be intrinsically disorderd enzymes, I'm not really sure how they would work, but of course, in biology, there's always a weird example that violates normal expectations because evolution once randomly tried somethign a billion years ago and got stuck with it.

panabee · on April 13, 2022

thanks for the clarification. your papers also seem interesting, will check those out.

the goal is to reliably characterize certain viral proteins as ID or ordered. would you happen to have any advice on this?

alan-hn · on April 14, 2022

Out of curiosity, what's the relevance of the classification in the context you present?

panabee · on April 14, 2022

if viral proteins are ordered, it becomes much easier to predict their functionality.

alan-hn · on April 14, 2022

So would the predictions be based entirely on the primary structure or assumed primary structure based on nucleotide sequence?

Sounds to me like you're doing some bioinformatics stuff to understand how some virus works, or am I totally off base on that guess?

panabee · on April 14, 2022

correct. if you downvoted the previous comment, would you mind sharing why?

if ordered proteins are not in fact easier to understand than ID ones, it would be helpful to understand why.

thanks for the feedback.

panabee · on April 14, 2022

instead of only downvoting, could someone also clarify how this statement is wrong to help educate myself on the facts? thanks for your help.

paraschopra · on April 16, 2022

I didn't downvote you but going from a structure prediction (which you get for ordered proteins) to functionality is not straightforward at all. How would you do it?

You could predict functionality based on homology with evolutionary related sequences, but they would work equally well for all proteins (ordered or not).

panabee · on April 17, 2022

thanks for replying.

to clarify, the statement isn't that it is straightforward -- simply that it's easier to analyze something that doesn't change shape unpredictably than something that does (fewer variables to consider).

whether predictions based on evolutionary related sequences are 100% accurate in human cells is another matter, especially when it comes to viruses.

jostmey · on April 13, 2022

The same protein can deform into multiple different 3D shapes, called conformations. Some proteins are rigid and exist almost exclusively in a single conformation. It is probably easier to determine the 3D structure of proteins with a single, dominant conformation. Other proteins don't have well defined conformations, and are more like a tangle of rope that can bend in many different ways

panabee · on April 13, 2022

thanks for the explanation. what are the biggest factors influencing conformation? what are the best ways today for imaging proteins with different conformations, and what are the limitations of these methods?

tintor · on April 13, 2022

Example by analogy: Flat tire has less defined structure, and can take many shapes. Inflated tire has more defined structure, and behaves more predictably.

abcc8 · on April 13, 2022

Many proteins have intrinsically disordered regions that are hypothesized to be directly related to the protein's role in the cell. These regions are termed disordered because current methods used to determine the structure of proteins are unable to resolve a regular structure for these regions in the context of a protein crystal or protein in solution. This publication is an informative review on the topic: https://pubs.acs.org/doi/10.1021/cr400525m

dekhn · on April 13, 2022

think loose floppy piles of spaghetti instead of well-defined rocks.

flobosg · on April 13, 2022

> I found it interesting that AlphaFold can't reliably predict the structures for mutations that disrupt structure

It’s not that surprising given the conceptual background of the method. Since it’s relying on evolutionarily coupled residues, AlphaFold is looking at sets of complementary mutations that keep or rescue a determined structure, i.e. the complete opposite of structural disruption.

> The list of protein structures we know isn't a representative sample of all possible protein structures

And the same goes for protein sequences.

axg11 · on April 13, 2022

Very nicely explained. Also hints at the next big frontier for protein folding: improving the prediction of those disruptive effects.

photochemsyn · on April 13, 2022

It's kind of surprising that AlphaFold has some success with random sequences of amino acids:

> "Baker’s team gets AlphaFold and RoseTTAFold to “hallucinate” new proteins. The researchers have altered the AI code so that, given random sequences of amino acids, the software will optimize them until they resemble something that the neural networks recognize as a protein. In December 2021, Baker and his colleagues reported expressing 129 of these hallucinated proteins in bacteria, and found that about one-fifth of them folded into something resembling their predicted shape."

20% is not that great but it has potential. One long-standing goal is the de novo design of protein-based industrial catalysts for specific chemical transformations. Proteins from bacteria that live in boiling sulfur vents etc. have been used to some extent, but the idea is that similar proteins could be designed for a much wider variety of industrial processes. As the article notes, specificity remains a challenge (and designed proteins don't approach the efficiency of the evolutionary selected proteins), but it still seems promising.

P.S. I'm a bit more skeptical about the drug-design programs. It's not so much that novel drugs can't be designed that bind to the desired targets, it's that they might bind to a whole lot of undesired targets as well, leading to nasty side effects. Now if you could screen against the whole proteome, perhaps.

jimmySixDOF · on April 14, 2022

Talking about higher level applications, BBC Science in Action [1] interviewed Prof John McGeehan of the Centre for Enzyme Innovation at Portsmouth University working on Bacteria breaking down Plastic in landfills. He explained his workflow of maybe selecting one candidate occasionally out of many due to the cost/time involved & how DeepMind gave him more results in one weekend that he had expected to see over his entire career.

[1] https://www.bbc.co.uk/sounds/play/w3ct1l3y

flobosg · on April 13, 2022

> 20% is not that great but it has potential.

20% success rate is in line with other protein design methods, though.

gfodor · on April 13, 2022

I'd imagine the success rate isn't apples to apples - the real measure is "time, energy, and manpower expenditure needed per generated protein"

flobosg · on April 13, 2022

Both measures can be quite similar. Most protein designs can be screened in parallel for solubility and successful designs can be further engineered and tested in a high-throughput manner.

Teever · on April 14, 2022

> and designed proteins don't approach the efficiency of the evolutionary selected proteins

This sounds interesting. Can you talk more about this? Efficiency in what sense, energy efficiency in the actual protein assembly in biological systems, or efficiency in actual performance of the protein while it's functioning in a biological system?

flobosg · on April 14, 2022

He’s probably referring to the catalytic efficiency of designed enzymes, so your second option.

dekhn · on April 13, 2022

I just wish people would stop using the word "fold" for this. It's not folding. It's just structure prediction. It's great at structure prediction (static prediction of a single structure) and not at all at the folding process (which is dynamic and rapidly changing).

flobosg · on April 13, 2022

“Protein fold” and “protein folding” are two different concepts. Folds are structural categories, folding is the biophysical process. But I agree that there are better words out there to name such a tool.

dekhn · on April 13, 2022

That's very misleading, as you can see. I believe we should not use the term fold for structural categories as it's a misnaming. It's a historical accident that came about before people began to understand that folding is a process, not an on/off switch.

See my work in this area: https://pubmed.ncbi.nlm.nih.gov/24345941/ which is explicitly attempting to simulate an approximate folding pathway(s).

flobosg · on April 13, 2022

There are other terms analogous to “fold”, like “topology” (as used in CATH), but they will probably never see widespread use.

gilleain · on April 13, 2022

Even 'topology' is a little confusing to those more familiar with the term from maths.

For the 'CATH' hierarchical classification, the 'Topology' level is something like the organization of secondary structure in an 'Architecture'. This has some relationship to topology in the general sense, but is a narrower definition.

For me, the 'fold' is what happens after 'folding' occurs, but I take the point that it is confusing.

dekhn · on April 13, 2022

Topology actually makes some sense here in that a very small number of proteins do fold into knots! This was a huge surprise and completely contradicted most predictions. https://en.wikipedia.org/wiki/Knotted_protein

gilleain · on April 14, 2022

Heh, yes I'm fairly familiar with knotted proteins, and Willie Taylor's work on them.

Strictly speaking most of them are not mathematical knots either, since the ends are not connected. Some have salt bridges, I seem to recall.

dekhn · on April 14, 2022

knots in the real world (actual, real knots) don't have tied-together ends either. That's not what a knot is. The study of topology in math works on abstract math-knots.

Generally, any non-covalent bonds aren't considered to be topological connections in proteins, although salt bridges can definitely ahve bond energies not far from covalent bonds.

flobosg · on April 13, 2022

If I recall correctly, the difference between Architecture and Topology in CATH is that the former is independent of connectivity.

> For me, the 'fold' is what happens after 'folding' occurs, but I take the point that it is confusing.

Same here.

daveguy · on April 13, 2022

The most accurate term would probably be "tertiary structure". Although AlphaTertiaryStructure is a mouthful. They could have named it AlphaTS.

flobosg · on April 13, 2022

> The most accurate term would probably be "tertiary structure".

There’s an AlphaFold variant that can predict quaternary structure: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2

> They could have named it AlphaTS.

TS looks more like “transition state” to me.

daveguy · on April 13, 2022

Ah, good point. TS would be as bad or worse than "Fold" in this context.

dekhn · on April 13, 2022

DeepMindProteinStructurePredictor or deep_mind_protein_structure_predictor if you don't like camel case

dekhn · on April 13, 2022

Yup. I've had this discussion repeatedly with the developers of SCOP and the folks who run CASP and they simply will not budge.

codeflo · on April 13, 2022

I think this article does a good job of highlighting the difference between simulations and ML-based approaches. The latter are faster, but have limitations outside of their training parameters. As with everything in ML, broader training data to cover those cases probably helps. Though I would guess some of the problems could be inherent, that there fundamentally is no computational shortcut to this problem, whether you use a neural network or not.

xnx · on April 13, 2022

Facebook: Releases a tool that makes amusing image mashups Google: Makes revolutionary progress in one of the hardest problems in chemistry

dekhn · on April 13, 2022

This wasn't Google, it was DeepMind. Google doesn't get any credit for this. I tried to start this project at Google but it conflicted with the Google Health team's goals.

xiphias2 · on April 13, 2022

Even if it’s a sister project, it’s great PR for Google. I accept more ads from Google as it gives back so much in healthcare. I wish Meta would do the same, I wouldn’t care if it’s part of Facebook or not. GMail was something similar at the start: just do something good, to make more people like Google.

As for your own project I’m sorry for you: there are no more 20% projects, like in the old times :(

bawolff · on April 13, 2022

Its owned by google (alphabet) i think they deserve some props for it.

dekhn · on April 13, 2022

google is owned by alphabet. DM and Google are siblings.

bawolff · on April 13, 2022

Most people say "google" to mean alphabet.

mattwest · on April 14, 2022

True, but applying that logic in order to give credit to Google is wrong, which is the point of the comment above.

smueller1234 · on April 14, 2022

If one wanted to make a case for that, arguably it's DrepMind's major infrastructure resource consumption that would give Google some credit. I can certainly point to some colleagues in infrastructure engineering who've helped DeepMind folks deal with infrastructure problems.

That being said, I'm perfectly happy to just be proud of what DeepMind has achieved. They've been great to interact with when there were shared challenges, and I - and to my knowledge all colleagues around me - aren't very interested in a question of credit assignment.

dekhn · on April 14, 2022

I agree, 98.3%.

Google's infrastructure and DeepMind's internal (not cloud) access to it has been absolutely critical. In many ways, DM is leading Google in terms of software development, while Google is leading in providing unique ML hardware capabilities.

flobosg · on April 13, 2022

To be fair, they have published a few papers and preprints related to the topic. See e.g. https://www.pnas.org/doi/10.1073/pnas.2016239118 and https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3

peter303 · on April 13, 2022

I see a Nobel Prize around the corner.

They arent often given for techniques or computation. But the results are outstanding.

mupuff1234 · on April 13, 2022

Does Deepmind sell anything? Their site has no mention of any type of offering.

benrapscallion · on April 13, 2022

They have spun out a drug design company named Isomorphic Labs. [1]

[1] https://www.isomorphiclabs.com/

dekhn · on April 13, 2022

amusingly, I work for a pharma and they don't even return our calls. I wonder how seriously they take this business, because if I was selling a product based on this, pharma would be my first customer.

alphabetting · on April 13, 2022

Could be wrong but I think Deepmind sees more value in elite AI/ML talent that Alphafold will draw and help retain than future potential profits on drug discovery. Open sourcing Alphafold and removing commercial restrictions wouldn't make much sense if drug profits were their goal.

dekhn · on April 13, 2022

No, isomorphic labs was set up to specifically commercialize this. If their goal is to be a discovery company, they are fairly naive.

alphabetting · on April 13, 2022

Yeah I know about Isomorphic labs. My point is that talent Alphafold will draw is more valuable than potential drug discovery profits.

mechagodzilla · on April 13, 2022

Where does the value come in if you pay them lots of money to work on unprofitable things? Just by virtue of not letting your competitors hire them?

alphabetting · on April 13, 2022

Profit is later. I strongly believe this take.

https://twitter.com/fchollet/status/1502775288257601540

folli · on April 13, 2022

If the promise of in silico drug design comes to fruition, the potential drug discovery profits could very well rival Google's ad profits.

dekhn · on April 13, 2022

Sure. AlphaFold is, in fact, the greatest shot at revenue that DeepMind has shown so far (and they are under intense pressure from Alphabet to show revenue).

alphabetting · on April 13, 2022

I don't think there is any pressure on that front. They are supposedly profitable now (though i'm guessing this is partially accounting tricks) but there just isn't a need to be profitable. Search and Youtube print money to fund their R&D ($31B last year alone). The goal is AGI or close to it.

https://venturebeat.com/2021/10/10/ai-lab-deepmind-becomes-p...

dekhn · on April 13, 2022

The "profit" you're pointing at is money that Google pays DeepMind to do software and machine learning as a service for them. This pays off, for example with Jax, where nobody in Google Research could touch it because Jeff Dean/Tensorflow, until DM demonstrated (with alphafold) that Jax could do nobel-prize-winning research, to the point where Jeff has admitted that tensorflow has serious problems and systems like jax are the future (see the palm paper!!!)

pkaye · on April 13, 2022

Must have taken the Google approach and already terminated the product. /s

elcomet · on April 13, 2022

Alphafold was released, the code is open source and the pretrained weights are available for free.

asdff · on April 13, 2022

I believe only for academic use though right? I don't know if it can be used for commercial use.

lucidrains · on April 13, 2022

incorrect, they modified the license so it can be used for commercial use - https://github.com/deepmind/alphafold/commit/8173117130e6df8...