Skip to content

Commit

Permalink
Added new examples and updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
pascalnotin committed Aug 27, 2021
1 parent efa412b commit aa9589e
Show file tree
Hide file tree
Showing 8 changed files with 674,004 additions and 6 deletions.
5 changes: 3 additions & 2 deletions EVE/VAE_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,8 +316,9 @@ def compute_evol_indices(self, msa_data, list_mutations_location, num_samples, b
for i,mutation in enumerate(list_valid_mutations):
sequence = list_valid_mutated_sequences[mutation]
for j,letter in enumerate(sequence):
k = msa_data.aa_dict[letter]
mutated_sequences_one_hot[i,j,k] = 1.0
if letter in msa_data.aa_dict:
k = msa_data.aa_dict[letter]
mutated_sequences_one_hot[i,j,k] = 1.0

mutated_sequences_one_hot = torch.tensor(mutated_sequences_one_hot)
dataloader = torch.utils.data.DataLoader(mutated_sequences_one_hot, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
Expand Down
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,17 @@ The "examples" folder contains sample bash scripts to obtain EVE scores for the
The corresponding MSA and ClinVar labels are provided in the data folder.

## Data requirements
The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins (see data/MSA for an example MSA for PTEN). The code provides basic functionalities to pre-process MSAs for modelling. By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user (see utils/data_utils.py for more details).
The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar) -- these labels are to be provided by the user (using a format similar to the example provided under data/labels).
The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins.

### MSA creation
We built multiple sequence alignments for each protein family by performing five search iterations of the profile HMM homology search tool Jackhmmer against the UniRef100 database of non-redundant protein sequences (downloaded on April 20th 2020). Please refer to the supplementary notes of the EVE paper (section 3.1.1) for a detailed description of the MSA creation process.
Our github repo provides the MSAs for 4 proteins: P53, PTEN, RASH & SCN5A (see data/MSA). MSAs for all proteins may be accessed on our website (https://evemodel.org/).

### MSA pre-processing
The EVE codebase provides basic functionalities to pre-process MSAs for modelling (see the MSA_processing class in utils/data_utils.py). By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user.

### ClinVar labels
The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar). We provide an labels for 4 proteins: P53, PTEN, RASH & SCN5A (see data/labels). ClinVar labels for all proteins may be accessed on our website (https://evemodel.org/).

## Software requirements
The entire codebase is written in python. Package requirements are as follows:
Expand Down
4 changes: 3 additions & 1 deletion compute_evol_indices.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@
if args.computation_mode=="all_singles":
data.save_all_singles(output_filename=args.all_singles_mutations_folder + os.sep + protein_name + "_all_singles.csv")
args.mutations_location = args.all_singles_mutations_folder + os.sep + protein_name + "_all_singles.csv"

else:
args.mutations_location = args.mutations_location + os.sep + protein_name + ".csv"

model_name = protein_name + "_" + args.model_name_suffix
print("Model name: "+str(model_name))

Expand Down
22,008 changes: 22,008 additions & 0 deletions data/MSA/P53_HUMAN_b0.1.a2m

Large diffs are not rendered by default.

592,332 changes: 592,332 additions & 0 deletions data/MSA/RASH_HUMAN_b03.a2m

Large diffs are not rendered by default.

59,240 changes: 59,240 additions & 0 deletions data/MSA/SCN5A_HUMAN_b1.0.a2m

Large diffs are not rendered by default.

Loading

0 comments on commit aa9589e

Please sign in to comment.