Skip to content

AC Suite - Hydra + Lightning integrated learning for Activity Cliffs (ACs).

Notifications You must be signed in to change notification settings

cmvcordova/acsuite

Repository files navigation

The Activity Cliff (AC) Suite

python pytorch lightning hydra PRs

AC Suite is a Pytorch Lightning + Hydra integrated utility to train and evaluate models for molecular property prediction (and beyond!) based on the Matched Molecular Pair (MMP) abstraction. Can be used to obtain embedding vectors that specifically capture the activity cliff relationship between structurally similar molecules with different activites, to enhance existing property prediction models.

Leveraging the lightning-hydra-template, it integrates CLI and hydra config functionality, such as multiruns, parameter sweeps, among others. Extended Connectivity Fingerprint (ECFP) based featurization is done through molfeat's MoleculeTransformer. Evaluation done with MoleculeACE.

Installation

Prerequisites

Using pip

The recommended way to install the AC Suite is through pip.

git clone https://github.com/cmvcordova/acsuite
cd acsuite
pip install .

Current functionality

Embedding pre-training

Pre-training is done as shown in the provided encoder_pretraining.ipynb notebook. Calls on the ACAModule class, from a LightningModule, to train models across a variety of tasks and objectives.

Pre-trained encoder extraction and re-training

The HotSwapEncoderMLP optionally takes a pretrained_encoder_ckpt pointing to a model trained in the former step with an ACAModule. It then extracts and freezes the pretrained encoder, placing it as the input layer of the MLP. This can then be trained with the provided ACAPPModule for both classification and regression tasks.

Examples of usage are included in the provided moleculeace_training-mlp/acbased.ipynb notebooks.

MoleculeACE evaluation.

Currently done manually, accesses any of the 30 provided MoleculeACE ChEMBL datasets. Check out the ac/mlp_moleculeace_evaluation.ipynb notebooks that are provided for evaluation function examples. Full evaluation coming soon.

Data

ACNet and MoleculeACE are the main data sources that are used for pre-training and downstream training and evaluation, respectively:

ACNet

Data must be downloaded and handled by following the guidelines in the ACNet repository. Generation of the datasets must be done by running GenerateACDatasets.py and placing the generated JSON files within the ACNet folder in the main AC Suite data directory.

MoleculeACE

Data accesible through the AC Suite dataset wrapper. Installation a pre-requisite for AC Suite but can otherwise be installed following the directions over at the MoleculeACE repo

Known Issues

  • Hadamard, Euclidean distance Siamese classifiers not thoroughly tested, training instabilities observed. There is likely a mismatch with how they're handled within the HalfStepSiameseEncoder model.
  • Evaluation multiruns not compatible with current script, done manually. See Hydra issue #1258.
  • Pre-training data downloading method not implemented, must be dealt with manually. See ACNet section
  • README could be more informative. Do not hesitate to open issues!

Associated publication(s)

Master's thesis publication: "Towards Learning Activity Cliff-Aware Molecular Representations"
Full paper/Poster accepted at LXAI @ ICML 2024!
Poster accepted at MoML 2024!

How to cite

If you found this useful, please use the following citation:

@inproceedings{ValdezCrdova2024,
  series = {LXAI at ICML 2024},
  title = {Towards Learning Activity Cliff-Aware Molecular Representations},
  url = {http://dx.doi.org/10.52591/lxai202407274},
  DOI = {10.52591/lxai202407274},
  booktitle = {LatinX in AI at International Conference on Machine Learning 2024},
  publisher = {Journal of LatinX in AI Research},
  author = {Valdez Córdova,  César Miguel},
  year = {2024},
  month = jul,
  collection = {LXAI at ICML 2024}
}

Contributing

Open an issue or drop me a line

Known Contributors

yours truly

Acknowledgements

you for the interest and the internet

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

AC Suite - Hydra + Lightning integrated learning for Activity Cliffs (ACs).

Resources

Stars

Watchers

Forks

Packages

No packages published