TLDW logo

AlphaFold - The Single Most Important AI Breakthrough

By Two Minute Papers

Summary

## Key takeaways - **Success Felt Suspiciously Easy**: It almost felt too easy. Too many ideas were working, performance kept going up, so they suspected leaking the test set and double-checked, but found no leak. [05:50], [06:11] - **Proteins: Nanomachines from DNA**: Proteins are nanomachines with a couple thousand atoms each, coded by DNA where three letters map to one of 20 chemical groups, built into a chain that folds into a compact 3D working machine. [01:23], [02:30] - **Year-Long Experiments to Minutes**: It takes a year and $100,000 for hard experiments using enormous synchrotrons to get a protein structure, but AlphaFold predicts it in five or 10 minutes with accuracy close to experimental. [03:16], [04:27] - **AlphaFold Predicts Disorder Too**: AlphaFold showed ugly long arcing ribbons on some proteins with low confidence, which matched experimentally known disordered regions, making its lowest confidence a state-of-the-art disorder predictor. [12:49], [13:14] - **Fertilization Protein via 2000 Runs**: Two labs ran AlphaFold on egg protein against 2000 sperm surface proteins, found one that stuck, confirmed in lab by knockout and mutations blocking fertilization. [16:03], [16:43] - **Dominant in Protein Design**: Even though not mutation sensitive, AlphaFold filtering became a secret of modern protein design, giving a 10-fold increase in success rate for designs that bind to each other. [18:01], [18:31]

Topics Covered

  • Success Felt Suspiciously Easy
  • Progress Alternates Terror-Elation
  • AI Predicts Unseen Protein Complexes
  • AlphaFold Discovers Fertilization Protein
  • AlphaFold Underpins Modern Biology

Full Transcript

favorite two-minute papers episode.

>> Oh, Alpha Fold. It's easy. It almost

felt too easy. It felt like too many of ideas were working. It felt it was going up and I remember um talking to Tim, the engineering lead, going, "This is really feeling too easy. We're having too much

success. This problem can't be this

success. This problem can't be this easy. Are we leaking the test set?"

easy. Are we leaking the test set?"

Right? You know, are we doing the classic machine learning sin?

Fellow scholars, I don't really like to be on camera, but there is a big reason I am here today.

You see, I met Nobel Prize winning chemist John Jumper last year, and we talked for an hour. And in that hour, I learned more than I thought I would learn in a year. It was it was

unbelievable. And today, I have the

unbelievable. And today, I have the opportunity to give you this amazing gift, too. So, with that said, hey,

gift, too. So, with that said, hey, John.

>> Hello.

really really grateful to have you here today. I have goosebumps which I have

today. I have goosebumps which I have carefully hidden under this lab coat.

>> So what is Alphaold and why is it important?

>> So Alphafold is a uh is a neural network which makes it relatively appropriate for the podcast but uh it is a deep learning system that predicts the result

of a specific scientific experiment. And

to tell you about that, I should tell you about the domain that it's in.

Proteins. So proteins are the nanom machines that basically drive your cell.

A couple thousand atoms each. Um they're

coded for by your DNA. When we say that DNA is an instruction manual for the cell, a lot of what it's telling you is how and when to build proteins. And so

three letters of your DNA map to individual one of 20 chemical groups.

Those chemical groups are basically just like little collections of atoms, you know, boop boop boop boop boop boop. And

there's a machine, another protein in the body that reads the DNA in a relatively complicated process and kind of builds out the proteins one step at a time, joining links in a chain or a rope. So it takes this chemical group,

rope. So it takes this chemical group, attaches that one, attaches that one, attaches that one, attaches it basically the same way each time, and builds out a string of maybe 300 of these is a reasonably typical length. And then what

happens when your cell builds this thing is of course it's not a machine. Most of

them are not machines that function kind of just as floppy ropes. They actually

some of parts of it are greasy. Some

parts of it are positively charged. Some

parts are negative. So it will fold up.

It will make helyses. It will make sheets. It will pack into a relatively

sheets. It will pack into a relatively compact 3D object that is kind of the assembled working machine. So these are machines that build themselves right or joined in 1D. And of course our DNA is

1D and our world is 3D. So this is kind of how the body solves this. It builds

these things. They fold up into this incredibly intricate shape. And this

happens, you know, there about 20,000 human proteins. There are hundreds of

human proteins. There are hundreds of millions, billions known um proteins across all organisms. Um and one part of what I described is really really easy

to measure. It's really easy to read our

to measure. It's really easy to read our DNA thanks to the genomics revolution.

You can think of it as pennies to read the sequence of a protein. the the DNA that becomes the linked amino acids. It

takes a year uh to get the structure of a protein and really hard experiments and they often fail and just extraordinarily difficult. If you want

extraordinarily difficult. If you want to put an economic value on it, maybe $100,000. So scientists do this

$100,000. So scientists do this experiment where they start from DNA, but they really want to understand how this machine works. So they need to see a picture of it. And so they uh determine the structure experimentally.

They use enormous synretrons the size of small uh small villages in order to do this. And so but people have done it a

this. And so but people have done it a lot. There's been enormous societal

lot. There's been enormous societal investment because it's really important to understand this to understand disease to do drug development. There are about now 200,000 known protein structures

about 140,000 when we did alpha fold and we developed a deep learning system that goes from u amino acid sequence DNA sequence to

the structure of a protein in five or 10 minutes instead of a year and does this with accuracy close to not quite as good but very close to experimental accuracy and it's been used enormously. So we've

predicted the structure of about 200 million proteins. Every protein from an

million proteins. Every protein from an organism whose full genome has been sequenced. Scientists are using it for

sequenced. Scientists are using it for drug development to understand the body, everything else. And I think from a

everything else. And I think from a machine learning point of view, it's both kind of the first problem really really transformed by AI. It's an

extraordinarily practical system that scientists are using. I think it's something like three

using. I think it's something like three million scientists have used our database of predictions. People make

predictions every day with this. And

it's also this kind of promise that we're going to use AI not just to do things that humans can do or solve human problems, but to do kind of superhuman level. There are no humans that are good

level. There are no humans that are good at getting the structure of a protein by eye. They do it with experiment. That we

eye. They do it with experiment. That we

can use this to transform science that we can build new tools that fundamentally advance our science.

>> Now, I remember asking you last year, how did it feel when it first started working?

the the time I really remember is that when it first started or really what would happen is we kept you know AlphaFold is built iteratively it's not

yesterday we didn't have AlphaFold today we did it was maybe two years and probably 30 40 different kind of individual ideas that worked along the

way some grand ideas some small ideas but each one kind of inching up the performance and I remember maybe a year into building AlphaFold um 2, the one

that was uh really very successful. It

almost felt too easy. It felt like too many of ideas were working. It felt it was going up. And I remember um talking to Tim, the engineering lead, going, "This is really feeling too easy. We're

having too much success. This problem

can't be this easy. Are we leaking the test set?" Right? You know, are we doing

test set?" Right? You know, are we doing the classic machine learning sin?

>> And he was sitting there going, "I don't think we are." are and we were we went back we double checked we zeroed coordinates in our eval set to make sure we weren't actually leaking we couldn't

really like ever find a leak but it felt too easy it felt like the nature shouldn't yield this easily to our efforts >> and I remember I didn't really wasn't really really totally sure until actually we did some structure

predictions for SARS cove 2 proteins related to COVID >> and then the experiment came out afterwards that we were really really sure okay we were really not leaking anything but it it was wild.

>> Wow, that's crazy. But that is also the the hallmark of a pro-scientist, you know, because during a research project, you miss a thousand balls and when you

finally hit one, you know, you don't ask questions, you celebrate. But that's not what you did, you know, you picked apart the performance immediately instead. So

that's amazing. I mean you but you you see pro aletes they don't you know they're when they when they miss they're always interrogating fixing thinking

like you this is a craft machine learning is a craft and you you you have to be a craft person to do it.

>> Mhm.

All right. Now the score I'm asking this because the score didn't jump from zero to 100 in just one magic trick. So this

was a sum of many brilliant little puzzle pieces and each of these contribute a little to the score. You

add another puzzle piece, you get another few points. And and what I'm wondering is that this this sounds like a linear progress. You know, you you're climbing step by step. So why is it s so

surprising that when when you get to the peak?

So you know much like Moore's law was a successive succession of ideas and breakthroughs that in total gave the appearance of inevitability >> and that inevitability in the case of

Moore's law was driven by you know exponential growth and investment as well you're when you think about when you do this there's you never know if

you're going to get the next win right you know you don't know in fact we have um charts of progress and they don't actually go like this, right? The ideas

that you know we list maybe go like that, but you the actual progress went flat flat flat. Oh, what about this idea idea idea idea flat flat flat flat idea idea idea. And in fact, the flat verse

idea idea. And in fact, the flat verse up we at the time Deep Mind was kind of on six month cycles. So every six months you kind of formally continued your project and you presented your results

to the whole company every six months.

And I remember the first three months we would always try our wildest ideas and it would mostly not work and we would get very scared and then about halfway through we'd like okay guys we got to

get serious we need to not have no progress and then suddenly some idea would hit and then a bunch of ideas would hit. So it was always you know it

would hit. So it was always you know it was alternation alternating elation and terror is more you know it's only when you when you make it really blurry and you squint you zoom out oh it went up

linearly. Yeah, it's like overnight

linearly. Yeah, it's like overnight successes 10 years in the making, right?

>> Yeah, it's that that sort of thing.

>> All right. Can you build an intuition on what proteins would look like when when folded up into a 3D structure? And also,

did you have a protein structure where you looked at the 3D result and said that cannot be right and it turned out to be right. Does that happen?

>> Okay, I'll tell two stories on this. I

mean, an intuition. You mean can I build an intuition not on on an individual protein? So sometimes you can say oh

protein? So sometimes you can say oh this looks really similar to this other protein >> and therefore I bet it's going to have about similar structure. So that's like what humans can do and that's we people

call it homology modeling. It's a very fancy name for saying well the sequence is similar probably the structure is similar. So you can do that and

similar. So you can do that and sometimes you can notice individual motifs like there were all these papers that would list all these motifs like helyses are very common element in

proteins and I remember a paper on well the last element of a helix is going to be one of these three amino acids the one before that's going to be some of you know so there's some regularities and human rules that they've cataloged

and you can kind of use that but ultimately it only works a little bit and doesn't give you the kind of precision you need to do drug development at all. In terms of things

actually that that surprised actually a real surprise came from machine learning I shouldn't have been surprised but I was I mean there were two big surprises.

One was sometimes we would have proteins with giant voided cavities in the middle or a protein that was like C-shaped and you know proteins made the atoms and proteins are really up against each

other right it's a very dense object and I said it doesn't look right um and but the model was extremely confident and we looked in the experimental

structure and then immediately realized what would happen so so alpha fold 2 the original alpha fold 2 was trained only on single proteins, but often when a

protein is solved, sometimes multiple copies of itself will appear, what's called a homr. So maybe three copies actually sometimes densely intertwine with each other to make the actual folded thing is not one copy. It's the

three copies together, a trimer. Or

there would be some other protein of a completely different type that it would say wrap around and they only appear together in the body. And sometimes

alpha fold would realize these patterns and leave these giant voids that look totally wrong or this spiral which is just floating in air and I'd be like well that's wrong but it's extraordinarily confident and then I would find out oh it realized that in

fact this protein comes in three copies and so this spiral is onethird of that and if you overlay it it's perfect. So

that even though alphafold we didn't tell it about this context it had learned rules that sometimes there are these geometric patterns which I can explain. I think the other big surprise

explain. I think the other big surprise was actually when we ran Alpha Fold across random proteins in humans. We

would see some bits that looked beautiful and structured and some really ugly long arcing ribbons. Oh no, that's wrong.

And I remember we looked at that and we wouldn't see this very much when we predicted proteins that were experimentally solved. We said, "Oh no,

experimentally solved. We said, "Oh no, are proteins that have been experimentally solved more or special and actually alpha fold isn't good on the things we hadn't solved." And then Katherine on the team a little later

that day looks in this um uniprott this uh database of various experimental facts about proteins which will tell you certain regions that are known for example experimentally to be disordered.

And she starts to realize that where AlphaFold is making these ridiculous long arcing predictions that can't possibly be correct and they aren't proteins. It was very low confidence and

proteins. It was very low confidence and those regions were disordered. And what

AlphaFold was in fact telling us is this region doesn't have a structure. um kind

of implicitly. So what we found out is that the lowest alpha fold confidence protein was actually pretty much a state-of-the-art predictor of whether a protein was disordered. And so we would

find all these things that we kind of knew about proteins but we didn't feel because disorder doesn't appear in this database of protein structures. We would

find all these things out just kind of looking at alpha fold and being surprised.

>> Mhm. Amazing. Amazing. Now I'll not ask what its most impactful application is because it has now hundreds of thousands of research works uh building on it in

just about five years which is unbelievable. So which one is your

unbelievable. So which one is your favorite?

I think I have two favorites.

One was this giant protein hundreds giant protein complex hundreds of protein chains called the nuclear pore.

The nuclear pore is actually um kind of the the the giant gates for the nucleus.

The nucleus stores your DNA, right? It's

where your nucleic material is and the rest of the cell is outside the nucleus.

And so you need a gatekeeper that decides who can enter and leave and kind of opens and contracts.

And I remember thinking, you know, this is enormous. Alphold does, you know, it's a

enormous. Alphold does, you know, it's a thousand times bigger than what AlphaFold can do. So we, you know, maybe later we'll come up with some machine learning that will help with these kind of problems. >> And then this paper comes out. The first

one I saw was out of the Kazinski and Beck lab saying we solve the structure of the nuclear pore that we knew something like 30% of before. Now we

know 60 70% and a lot of the rest is actually disordered. um because we

actually disordered. um because we combined very low resolution experimental techniques cryoet with alpha fold for the individual pieces and running different alpha folds and finding all the little kind of joins and

compartments and then we could finally build the model of the nuclear pore and in fact that and some very related papers were a special issue of science um all about the structure of the nuclear pore and three out of the four

made huge use of alpha fold I remember searching through these papers and maybe 150 mentions of the word alphafold in this in work that we didn't too that all we did was make the software tool that

scientists use to make amazing discoveries. And I just felt like, you

discoveries. And I just felt like, you know, I I'm the moment I the the Nobel is extraordinary. And now I'm waiting

is extraordinary. And now I'm waiting for the Nobel of someone who used AlphaFold and their own creativity to discover the next thing.

>> Yeah. The second order Nobel is the one that I'm I can't wait for.

>> And I think the other one was people discovered all these uses of Alpha Fold that we didn't expect to really work. So

they would run thousands and thousands of AlphaFold predictions and just see which one that AlphaFold liked. So the

one I really loved, there was a paper on fertilization. How does egg and sperm

fertilization. How does egg and sperm come together? And there are proteins on

come together? And there are proteins on egg and there are proteins on sperm that kind of join together and they recognize each other and they start fertilization.

But it was known that there was a protein in humans that was missing that something didn't make sense. And there

were actually two labs that did this that took this protein on the egg and 2,000 proteins, every one that appears on the surface of sperm and just ran 2,000 alpha fold predictions and they

found one specific protein that alphafold thought stuck up against this egg protein. And then they go to the lab

egg protein. And then they go to the lab and they say knock this protein out and egg and sperm will come together but not start fertilization. They'll make

start fertilization. They'll make mutations in the individual regions in which these come together and they'll find out that blocks fertilization. So

they pretty they've established biochemistry now this thing they had no idea they had no idea which of these 2,000 to look at and alpha fold said look at just this one and sure enough that was the protein that was essential

in this and I love this notion that we would never do this with experiment you would never send out 2,000 labs to make 2,000 structures and see which one comes back >> that we can do new types of science because of the scale we've achieved.

>> Yeah.

Incredible. Any unexpected use cases?

>> All right. So one that really surprised me I I can tell you an unexpected weakness and an and then an unexpected strength of alpha fold. So the

unexpected weakness is if you take a protein and you break it, you do something that's going to cause it to be unstable. Like one very strong rule of

unstable. Like one very strong rule of proteins is that positively or negatively charged amino acids don't appear in the greasy middle part of a

protein, right? They don't like grease.

protein, right? They don't like grease.

And so a spartic acid is a very small charged amino acid. doesn't really

appear in the center of proteins. So if

you take a a a protein and you mutate one of the inner amino acids to a spartic, alpha fold won't really change its structure. Even though this doesn't

its structure. Even though this doesn't make sense and it's there's reasons you can explain it, but we say alpha fold is not extremely point mutation sensitive.

It's answering a slightly different question. So we said okay that's some

question. So we said okay that's some future work.

And so there are a lot of people who do protein designs and were using alphafold to check their designs and say which ones does their design method work? Does

it produce struct does it produce sequences that alphafold thinks folded to the structure they were trying to make? And I remember thinking that's

make? And I remember thinking that's probably not going to work because alphafold isn't mutation sensitive. It

doesn't have a sensitive enough understanding of the interactions. But I

was totally wrong about that. And people

found that it was actually really good when it came to design proteins at figuring out which ones might work. One

paper that came out um a few months after AlphaFold said that they when designing proteins to bind to each other, they get a t-fold increase in success rate if they only make the

things that Alphafold thinks binds.

>> Mh.

>> And it's become really dominant actually that Alphafold filtering is one of the secrets of modern protein design. Even

though we tried we were designing a natural protein system, we got kind of this enormous design improvement for free. Mhm. Now, just to showcase the

free. Mhm. Now, just to showcase the influence of AlphaFold, in my opinion, let me hold on to my papers for this one to make sure I word this properly.

>> Oh, yeah.

>> In 20 years, nearly every person with access to modern healthcare will benefit from a tool, diagnostic or drug influenced by Alphaold. What do you think?

>> I think that's that's pretty fair. I

think that it is it is now a tool of modern biology. And I will say that

modern biology. And I will say that there are other tools like every every you know biological discovery today in

some way benefits from DNA se DNA sequencing right DNA synthesis right these are these are tools that underpin the kind of technology of modern biology

>> and alpha fold is very certainly one of those that that like it you you pe people teach it to grad students right it's a standard part of the graduate curriculum We will learn how to do some things and I will show you how to use

alpha fold because you will probably use it in your research and then people make all these discoveries and these discoveries compound and grow. That's

the wonderful part of working in research is that you have this enormous spreading out of the work you do. It's

not just, you know, it's wonderful. I I

think sometimes it's wonderful to be a to be a doctor, to be someone who very definitely and obviously decides the right treatment for a patient and make someone healthy. But I also love the

someone healthy. But I also love the thought of being a researcher that I can build a tool that will help a hundred thousand people, that will help a million, that will help a billion be

healthy in the in the fullness of time as kind of it helps every it helps bring forward science. You know, I I like to

forward science. You know, I I like to think that AlphaFold maybe made structural biology, which is one of the major fields of biology, five or 10% faster,

>> right? And that's extraordinary.

>> right? And that's extraordinary.

Is it a possibility that you know Alpha lists it gives you a confidence score too not just a prediction can it be

confidently incorrect?

>> Yes. Um so we uh a very simple analogy.

If um if the weather report says there's a 90% chance of rain today and it doesn't rain, was it wrong?

Um some people will say yes, but that's not obviously correct. You're supposed

to be wrong one time in 10. So we can say Alphafold's confidence is calibrated. So that's what we can really

calibrated. So that's what we can really say is that you know 90% chance of being or you know really what we say is average accuracy is 0.9 on a certain scale called LDDT. So if our confidence

says 0.9 then on average it will be there but some of them will be very bad and actually we know a very interesting failure mode of very high confidence.

Sometimes it's just wrong. But more

commonly, for example, a protein will have two structures and alpha fold will produce one with high confidence, but it won't but you really wanted the other one. And so it it confidence more

one. And so it it confidence more reflects does this structure make sense as one state of the protein, but it doesn't necessarily say it's every state of the protein or the one you care about.

>> All right, let's have a lightning round.

I ask you something and try to answer in one sentence.

>> Oh, that's hard for me.

How did Alpha 2 improve on the first one?

>> We did machine learning research uh at the intersections of protein and ML, not taking ML off the shelf and applying it to proteins.

>> Alpha 43, >> we expanded it to do the protein cinematic universe and we adjusted the architecture to make it work.

Alpha Proteto.

>> It developed new techniques to design more efficiently using Alpha Fold and other ideas.

>> Favorite two-minute papers episode.

Oh, >> Alpha Fold. It's easy.

>> Kidding. Yes. All right, John. I've

learned so much again. Huge honor. Thank

you so much. It

>> was a pleasure. Thank you.

Loading...

Loading video analysis...