Skip to content

Latest commit

 

History

History
14 lines (7 loc) · 2.62 KB

README.md

File metadata and controls

14 lines (7 loc) · 2.62 KB

esm2_masked_lm

Various masked LM ideas using EMS-2.

  • masked_lm_clustering shows how to perform hierarchical clustering of latent embeddings of proteins using the masked protein language model ESM-2. This uses a sequence with (possibly mutiple) masked residues, computes the top m most likely and least likely protein sequences conditioned on all positions being masked simultaneously. It then uses persistent homology, DBSCAN, and HDBSCAN (along with $k$-Means and Agglomerative Clustering for comparison) to cluster the sequences. HDBSCAN returns a clustering hierarchy reminiscent of an evolutionary tree for protein sequences generated by the model.

  • ems2_mutations implements part of the paper Language models enable zero-shot prediction of the effects of mutations on protein function using ESM-2 instead of ESM-1v. See also the META repo

  • scoring_mutations computes the masked_marginal_score, the wild_type_marginal_score, the mutant_type_marginal_score, and the pseudolikelihood_score for a list of mutated sequences predicted to be the most and least likely by ESM-2 based on a fixed wild-type sequences, and with a fixed target mutation sequence. This is closely related to the previous notebook, and finishes implementing the scoring functions mentioned in Language models enable zero-shot prediction of the effects of mutations on protein function using ESM-2. You can swap out facebook/esm2_t6_8M_UR50D for one of the other larger models.

  • sequence_classification builds a basic protein sequence classifier with three labels for enzymes, receptor proteins, and structural proteins. It uses the facebook/esm2_t6_8M_UR50D and thus is lightweight and easy to train, yet accurate.

  • residue_classification trains a small residue classifier using facebook/esm2_t6_8M_UR50D to classify residues into three classes: Exposed to Solvent, Binding Site, or Transmembrane Region.