Alphafold 2 explained: A semi-deep dive

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Late last month, DeepMind, Google's machine learning research arm known for creating bots that beat world champions at Go and StarCraft II, Hit a new standard: Accurately predicting protein structure. If their results are as good as the team claims, their model, the alphafold, could be a boon for both drug discovery and basic biological research. But how does this new neural network-based model work? In this post, I'll try to give you a brief but semi-deep dive behind both the machine learning and the biology that powers this model.

First, a quick biology primer: The functions of proteins in the body are determined entirely by their three-dimensional structure. For example, it is the infamous “spike proteins” that stud the coronavirus that allow the virus to enter our cells. Meanwhile, mRNA vaccines such as Moderna's and Pfizer's mimic the shape of these spiked proteins, causing the body to mount an immune response. But historically, determining protein structure (using experimental techniques such as X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy) has been difficult, slow, and expensive. Also, for some types of proteins, these techniques don't work at all.

In theory, though, the protein should have a perfect 3D shape. Appointed by The string of amino acids that make it up. And we can easily determine the amino acid sequence of a protein by DNA sequencing (remember from Bio 101 how your DNA codes for the amino acid sequence?) but in practice However, predicting protein structure from amino acid sequences has been a hair-pulling challenge that we have been trying to solve for decades.

This is where the alpha fold comes in. It is a neural network-based algorithm that has performed surprisingly well on the protein folding problem, so much so that it seems to rival the quality of traditional slow and expensive imaging methods.

Sadly for nerds like me, we can't know exactly what the alpha fold does because the official paper has yet to be published and peer-reviewed. Until then, all we have to do is go by the company. Blog post. But since Alphafold (2) is actually a published iteration of a slightly older model (Alphafold 1). Last year, we can make some pretty good guesses. In this post, I will focus on two main pieces: the basic neural architecture of Alphafold 2 and how it makes effective use of unlabeled data.

First, this new development isn't all that different from a similar AI development I wrote about a few months ago, GPT-3. GPT-3 was a large language model created by OpenAI that could write impressively human-like poems, sonnets, jokes, and even code samples. What made GPT-3 so powerful was that it was trained on a very large dataset, and was based on a type of neural network called a “transformer”.

TransformersInvented in 2017, it seems like a truly magical machine learning hammer that smashes open problems in every domain. In an Intro Machine Learning class, you will often learn to use different model architectures for different types of data. Recurrent neural networks are for analyzing text. Transformers were originally invented to perform machine translation, but they appear to be more widely effective, able to understand text, images, and now proteins. So a major difference between Alphafold 1 and Alphafold 2 is that the former used Concurrent Neural Networks (CNNs) and the newer version uses transformers.

Now let's talk about the data that was used to train the alpha fold. According to the blog post, the model was trained. Public dataset 170,000 proteins with known structures, and a huge database of protein sequences with unknown structures. A public dataset of known proteins serves as the model's labeled training dataset, a fundamental fact. Size is relative, but based on my experience, 170,000 “labeled” examples is a very small training dataset for such a complex problem. That said, the authors would have done a good job of exploiting this “unlabeled” dataset of proteins with unknown structures.

But what good is a dataset of protein sequences with mysterious motifs? It turns out that learning from unlabeled data—”unsupervised learning”—has enabled many recent AI breakthroughs. GPT-3, for example, was trained on a large corpus of unlabeled text data scraped from the web. Given a fragment of a sentence, he had to predict which words came next, a task called “prediction of the next word,” which forced him to learn something about the basic structure of language. What did This technique has also been adapted for images: cut an image in half, and ask a model to estimate how the bottom of the image should look from above:

Photo from https://openai.com/blog/image-gpt/

The idea is that, if you don't have enough data to train a model to do the job you want, train it to do something similar on a task for which you have enough data. is, a job that forces him to learn something about it. Basic structure of language, or images, or proteins. Then you can fine-tune it for the task you really wanted to do.

A very popular way to do this is through embedding. Embedding is a method of mapping data into vectors that have spatial resolution meanings. A famous example is Word2Vec: This is a tool for taking a word (ie “hammer”) and mapping it into an N-dimensional space so that similar words (“screwdriver,” “nail”) are mapped nearby. And, like GPT-3, it was trained on an unlabeled text dataset.

So what is the equivalent of Word2Vec for molecular biology? How can we squeeze knowledge from amino acid chains with unknown, unlabeled structures? One technique is to look for clusters of proteins with similar amino acid sequences. Often, the sequence of one protein can be similar to another because both share the same evolutionary origin. The more similar these amino acid sequences are, the more likely it is that those proteins serve the same purpose for the organisms in which they are made, which means, in turn, that they have similar structures. is more likely.

So the first step is to determine how similar the two amino acid sequences are. To do this, biologists typically compute something called an MSA or multiple sequence alignment. One amino acid sequence may be very similar to another, but it may have some extra or “inserted” amino acids that make it longer than the other. MSA is a method in which differences can be added to make the sequence as equal as possible.

Multiple sequence alignment

Image of an MSA. Modi, V., Dunbrack, RL A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains. Science Rep. 9, 19790 (2019).

As outlined in DeepMind's blog post, MSA appears to be an important initial step in the model.

Alpha fold architecture

Diagram from the Alpha Fold blog post.

You can also see from this diagram that DeepMind is computing the MSA embedding. This is where they are taking advantage of all the unlabeled data. For this I had to call in a favor from my Harvard biologist friend. It turns out that in sets of similar (but not identical) proteins, the ways in which amino acid sequences vary are often correlated. For example, a mutation at the 13th amino acid may often be accompanied by a mutation at the 27th. Amino acids that are far apart in a sequence should generally not have much effect on each other, unless they are close in 3D space when the protein is folded, which predicts the overall shape of the protein. is a valuable indicator for Therefore, although we do not know the shape of the sequences in this unlabeled dataset, these correlated variations are informative. Neural networks can learn from such patterns, distilling them as embedding layers, which seems to be what Alphafold 2 is doing.

And that, in a nutshell, is a primer on some of the machine learning and biology behind Alphafold 2. Of course, we'll have to wait until the paper is published to know the full scope. Here's hoping it's really as powerful as we think.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment