Skip to content

Latest commit

 

History

History
163 lines (136 loc) · 11.3 KB

File metadata and controls

163 lines (136 loc) · 11.3 KB

AlphaFold-Metainference

Context

The application of deep learning methods to the protein folding problem has transformed our ability to generate accurate models of the native states of proteins from the knowledge of their amino acid sequences. The initial predictions of the native structures of proteins have also been recently extended to protein complexes. These advances have prompted the question of whether it is possible to use this type of approach for the prediction of the conformational fluctuations of the native states of folded proteins, and more generally for the characterisation of the structural properties of the native states of disordered proteins. Support for this idea comes from the observation that AlphaFold performs as well as current state-of-the-art predictors of protein disorder. It has also been reported that the predicted aligned error (PAE) maps from AlphaFold are correlated with the distance variation matrices from molecular dynamics simulations, suggesting that AlphaFold provides information about the dynamics of proteins in addition to their structures.

Since the native states of disordered proteins can be represented in terms of ensembles of conformations with statistical weights obeying the Boltzmann distribution, a relevant goal is to extend AlphaFold to predict structural ensembles of partially disordered proteins. In this work, we propose an approach to perform this task by basing this approach on the observation that AlphaFold can predict inter-residue distances even for disordered proteins, despite having been trained on folded proteins and combining such information with molecular dynamics within the Metainference framework.

Theory

More information about AlphaFold-Metainference can be found here

AlphaFold distance map prediction. Average inter-residue distances of disordered proteins can be predicted through the distogram head of AlphaFold using this colab notebook. These distances are defined as those between the β carbon atom positions for all amino acids except glycine, for which the α carbon atom positions are instead employed. The multiple sequence alignment (MSA) is conducted by MMseqs2 (default setting) on BFD/MGnify and Uniclust30. Model 1.1.1 of AlphaFold (default setting) is used for the predictions, with no structural templates. AlphaFold describes the distribution of inter-residue distances into 64 bins of equal width, covering the range from 2.15625 to 21.84375 Å, with the last bin also including distances longer than 21.84375 Å. For each pair of residues ($i$ and $j$), AlphaFold predicts the probability $p_{ij}^{b}$ that their distance is within bin $b$. The predicted distance ${\widehat{d}}_{ij}$ of the predicted distribution of the distances between residue $i$ and $j$ are esimated as

$\hat{d_{ij}}=\sum_{b=1}^{64} d^{b} p_{ij}^{b} $

where $d^{b}$ represents the central value of bin $b$.

Metainference. Metainference is a Bayesian inference method that enables the determination of structural ensembles by combining prior information and experimental data according to the maximum entropy principle. In AF-MI, Metainference is implemented by using the distance matrix $\mathbf{d}^{AF}$ predicted by AlphaFold as pseudo-experimental data. By design, metainference can disentangle structural heterogeneity from systematic errors, such as force field or forward model inaccuracies, random errors in the data, and errors due to the limited sample size of the ensemble. The molecular simulations are carried out according to the metainference energy function, $E = - k_{B}T\log\left( p_{MI} \right)$, where $k_{B}$ is the Boltzmann constant, T is the temperature, and $p_{MI}$ is the metainference, maximum-entropy-compatible, posterior probability

$p_{MI}\left( \mathbf{X},\sigma^{SEM},\sigma^{B} \right|\mathbf{D}) = \ \prod_{r = 1}^{N_{R}}{p\ \left( X_{r} \right)\prod_{i = 1}^{N_{D}}{p(\mathbf{D|\ X},\sigma_{i}^{SEM},\sigma_{r,i}^{Β})p(\sigma_{r,i})\ \ \ \ \ \ \ \ \ \ }}$ (3)

In this formula, X denotes the vector comprising the atomic coordinates of the structural ensemble, consisting of individual replicas $X_r$ ($N_R$ in total), $σ^{SEM}$ the error associated to the limited number of replicas in the ensemble, $σ_B$ the random and systematic errors in the prior molecular dynamics forcefield as well as in the forward model and the data, and $\mathbf{d}^{AF}$ the AF distance matrix. Note that $σ^{SEM}$ is calculated for each data point ($σ_i^{SEM}$), while $σ^B$ is computed for each data point i and replica r as $σ_{r,i}^B$. The functional form of the likelihood $p(\mathbf{d}^{AF}|\mathbf{X}, σ_i^{SEM} , σ_{r,i}^B)$ is a Gaussian function

$$p(\mathbf{d}^{AF}\mathbf{|\ X},\sigma_{i}^{SEM},\sigma_{r,i}^{Β}) = \frac{1}{\sqrt{2\pi}\ \sqrt{\left( \sigma_{r,i}^{Β} \right)^{2} + \left( \sigma_{i}^{SEM} \right)^{2}}}exp\left\lceil - \frac{1}{2}\frac{\left\lbrack d_{i,j}^{AF} - d_{ij}\left( \mathbf{X} \right) \right\rbrack^{2}}{\left( \sigma_{r,i}^{Β} \right)^{2} + \left( \sigma_{i}^{SEM} \right)^{2}} \right\rceil\ \ \ (4)$$

where $d_{i,j}(X)$ represents the forward model for data point i,j, namely the i,j distance calculated in the ensemble. For multiple replicas, the metainference energy function is

$$E_{MI}\left( \mathbf{X},\sigma \right) = E_{MD}\left( \mathbf{X} \right) + \frac{k_{B}T}{2}\sum_{r,i}^{N_{R},N_{D}}\frac{\left\lbrack d_{i} - f_{i}\left( X_{r} \right) \right\rbrack^{2}}{\left( \sigma_{r,i}^{Β} \right)^{2} + \left( \sigma_{i}^{SEM} \right)^{2}} + E_{\sigma}\ \ \ (5)$$

where $E_σ$ corresponds to the energy term associated with the errors

$$E_{\sigma} = k_{B}T\sum_{r,i}^{N_{R},N_{D}}{- \log{\left( \sigma_{r,i}^{Β} \right) + \frac{1}{2}}\log\left\lbrack \left( \sigma_{r,i}^{Β} \right)^{2} + \left( \sigma_{i}^{SEM} \right)^{2} \right\rbrack}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6).$$

Finally, $E_{MD}$ corresponds to the potential energy function of the molecular dynamics force field, which in this case is the CALVADOS 2 force field. While the space of conformations $X_r$ is sampled through multi-replica simulations (in this tutorial we used six replicas, however the larger this number the better) the error parameters for each datapoint $σ_{r,i}^B$ are sampled through a Gibbs sampling scheme at each time step. The range of the error sampling in this tutorial is set to[0.0001,10] and the associated trial move error perturbation of the Gibbs sampling is set 0.1. For more information on how to select these parameters, refer here. The error parameter due to the limited number of replicas used to estimate the forward model ($σ^{SEM}$) is calculated on the fly by window averaging every 200 steps of molecular dynamics.

AF-MI advantages

By combining AlphaFold inter-resiude distance data with the CALVADOS 2 forcefield to retrieve structural ensembles of proteins are:

  • The high convergence speed due to the coarse grained model and a metadynamics bias
  • The increased accuracy in modeling partially ordered protein interactions relyig on the introduction of AlphaFold-based distance restraints