Skip to content

Commit

Permalink
Added details about MSA pre-processing
Browse files Browse the repository at this point in the history
  • Loading branch information
pascalnotin authored May 6, 2021
1 parent e8d0399 commit efa412b
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Evolutionary model of Variant Effects (EVE)

This repository contains the code to create EVE models as per our paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning" (https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1).
This is the official code repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning" (https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1). This project is a joint collaboration between the Marks lab (https://www.deboramarkslab.com/) and the OATML group (https://oatml.cs.ox.ac.uk/).

## Overview
EVE is a set of protein-specific models providing for any single amino acid mutation of interest a score reflecting the propensity of the resulting protein to be pathogenic. For each protein family, a Bayesian VAE learns a distribution over amino acid sequences from evolutionary data. It enables the computation of an evolutionary index for each mutant, which approximates the log-likelihood ratio of the mutant vs the wild type. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index. The EVE scores reflect probabilistic assignments to the pathogenic cluster.
Expand All @@ -17,8 +17,8 @@ The "examples" folder contains sample bash scripts to obtain EVE scores for the
The corresponding MSA and ClinVar labels are provided in the data folder.

## Data requirements
The only data required to train EVE models and obtain scores from scratch are the multiple sequence alignments for the corresponding proteins.
The third script (train_GMM_and_compute_EVE_scores.py) provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) to be provided by the user.
The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins (see data/MSA for an example MSA for PTEN). The code provides basic functionalities to pre-process MSAs for modelling. By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user (see utils/data_utils.py for more details).
The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) -- these labels are to be provided by the user (using a format similar to the example provided under data/labels).

## Software requirements
The entire codebase is written in python. Package requirements are as follows:
Expand Down Expand Up @@ -49,4 +49,4 @@ Large-scale clinical interpretation of genetic variants using evolutionary data
Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Kelly Brock, Yarin Gal, Debora S. Marks
bioRxiv 2020.12.21.423785
doi: https://doi.org/10.1101/2020.12.21.423785
```
```

0 comments on commit efa412b

Please sign in to comment.