Added new examples and updated readme

OATML-Markslab · Aug 27, 2021 · aa9589e · aa9589e
1 parent efa412b
commit aa9589e
Show file tree

Hide file tree

Showing 8 changed files with 674,004 additions and 6 deletions.
diff --git a/EVE/VAE_model.py b/EVE/VAE_model.py
@@ -316,8 +316,9 @@ def compute_evol_indices(self, msa_data, list_mutations_location, num_samples, b
         for i,mutation in enumerate(list_valid_mutations):
             sequence = list_valid_mutated_sequences[mutation]
             for j,letter in enumerate(sequence):
-                k = msa_data.aa_dict[letter]
-                mutated_sequences_one_hot[i,j,k] = 1.0
+                if letter in msa_data.aa_dict:
+                    k = msa_data.aa_dict[letter]
+                    mutated_sequences_one_hot[i,j,k] = 1.0
 
         mutated_sequences_one_hot = torch.tensor(mutated_sequences_one_hot)
         dataloader = torch.utils.data.DataLoader(mutated_sequences_one_hot, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)

diff --git a/README.md b/README.md
@@ -17,8 +17,17 @@ The "examples" folder contains sample bash scripts to obtain EVE scores for the
 The corresponding MSA and ClinVar labels are provided in the data folder.
 
 ## Data requirements
-The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins (see data/MSA for an example MSA for PTEN). The code provides basic functionalities to pre-process MSAs for modelling. By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user (see utils/data_utils.py for more details).
-The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) -- these labels are to be provided by the user (using a format similar to the example provided under data/labels).
+The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins. 
+
+### MSA creation
+We built multiple sequence alignments for each protein family by performing five search iterations of the profile HMM homology search tool Jackhmmer against the UniRef100 database of non-redundant protein sequences (downloaded on April 20th 2020). Please refer to the supplementary notes of the EVE paper (section 3.1.1) for a detailed description of the MSA creation process. 
+Our github repo provides the MSAs for 4 proteins: P53, PTEN, RASH & SCN5A (see data/MSA). MSAs for all proteins may be accessed on our website (https://evemodel.org/).
+
+### MSA pre-processing
+The EVE codebase provides basic functionalities to pre-process MSAs for modelling (see the MSA_processing class in utils/data_utils.py). By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user.
+
+### ClinVar labels
+The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar). We provide an labels for 4 proteins: P53, PTEN, RASH & SCN5A (see data/labels). ClinVar labels for all proteins may be accessed on our website (https://evemodel.org/).
 
 ## Software requirements
 The entire codebase is written in python. Package requirements are as follows:

diff --git a/compute_evol_indices.py b/compute_evol_indices.py
@@ -52,7 +52,9 @@
     if args.computation_mode=="all_singles":
         data.save_all_singles(output_filename=args.all_singles_mutations_folder + os.sep + protein_name + "_all_singles.csv")
         args.mutations_location = args.all_singles_mutations_folder + os.sep + protein_name + "_all_singles.csv"
-
+    else:
+        args.mutations_location = args.mutations_location + os.sep + protein_name + ".csv"
+
     model_name = protein_name + "_" + args.model_name_suffix
     print("Model name: "+str(model_name))
 

diff --git a/data/MSA/P53_HUMAN_b0.1.a2m b/data/MSA/P53_HUMAN_b0.1.a2m
diff --git a/data/MSA/RASH_HUMAN_b03.a2m b/data/MSA/RASH_HUMAN_b03.a2m
diff --git a/data/MSA/SCN5A_HUMAN_b1.0.a2m b/data/MSA/SCN5A_HUMAN_b1.0.a2m