diff --git a/README.md b/README.md
index 0604852..7265514 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # Evolutionary model of Variant Effects (EVE)
 
-This repository contains the code to create EVE models as per our paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning" (https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1).
+This is the official code repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning" (https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1). This project is a joint collaboration between the Marks lab (https://www.deboramarkslab.com/) and the OATML group (https://oatml.cs.ox.ac.uk/).
 
 ## Overview
 EVE is a set of protein-specific models providing for any single amino acid mutation of interest a score reflecting the propensity of the resulting protein to be pathogenic. For each protein family, a Bayesian VAE learns a distribution over amino acid sequences from evolutionary data. It enables the computation of an evolutionary index for each mutant, which approximates the log-likelihood ratio of the mutant vs the wild type. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index. The EVE scores reflect probabilistic assignments to the pathogenic cluster.
@@ -17,8 +17,8 @@ The "examples" folder contains sample bash scripts to obtain EVE scores for the
 The corresponding MSA and ClinVar labels are provided in the data folder.
 
 ## Data requirements
-The only data required to train EVE models and obtain scores from scratch are the multiple sequence alignments for the corresponding proteins. 
-The third script (train_GMM_and_compute_EVE_scores.py) provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) to be provided by the user.
+The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins (see data/MSA for an example MSA for PTEN). The code provides basic functionalities to pre-process MSAs for modelling. By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user (see utils/data_utils.py for more details).
+The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) -- these labels are to be provided by the user (using a format similar to the example provided under data/labels).
 
 ## Software requirements
 The entire codebase is written in python. Package requirements are as follows:
@@ -49,4 +49,4 @@ Large-scale clinical interpretation of genetic variants using evolutionary data
 Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Kelly Brock, Yarin Gal, Debora S. Marks
 bioRxiv 2020.12.21.423785
 doi: https://doi.org/10.1101/2020.12.21.423785
-```
\ No newline at end of file
+```