AC Suite is a Pytorch Lightning + Hydra integrated utility to train and evaluate models for molecular property prediction (and beyond!) based on the Matched Molecular Pair (MMP) abstraction. Can be used to obtain embedding vectors that specifically capture the activity cliff relationship between structurally similar molecules with different activites, to enhance existing property prediction models.
Leveraging the lightning-hydra-template, it integrates CLI and hydra config functionality, such as multiruns, parameter sweeps, among others. Extended Connectivity Fingerprint (ECFP) based featurization is done through molfeat's MoleculeTransformer
. Evaluation done with MoleculeACE.
The recommended way to install the AC Suite is through pip
.
git clone https://github.com/cmvcordova/acsuite
cd acsuite
pip install .
Pre-training is done as shown in the provided encoder_pretraining.ipynb
notebook. Calls on the ACAModule class, from a LightningModule, to train models across a variety of tasks and objectives.
The HotSwapEncoderMLP
optionally takes a pretrained_encoder_ckpt
pointing to a model trained in the former step with an ACAModule
. It then extracts and freezes the pretrained encoder, placing it as the input layer of the MLP. This can then be trained with the provided ACAPPModule
for both classification and regression tasks.
Examples of usage are included in the provided moleculeace_training-mlp/acbased.ipynb
notebooks.
Currently done manually, accesses any of the 30 provided MoleculeACE ChEMBL datasets. Check out the ac/mlp_moleculeace_evaluation.ipynb
notebooks that are provided for evaluation function examples. Full evaluation coming soon.
ACNet and MoleculeACE are the main data sources that are used for pre-training and downstream training and evaluation, respectively:
Data must be downloaded and handled by following the guidelines in the ACNet repository.
Generation of the datasets must be done by running GenerateACDatasets.py
and placing the generated JSON
files within the ACNet
folder in the main AC Suite data
directory.
Data accesible through the AC Suite dataset wrapper. Installation a pre-requisite for AC Suite but can otherwise be installed following the directions over at the MoleculeACE repo
- Hadamard, Euclidean distance Siamese classifiers not thoroughly tested, training instabilities observed. There is likely a mismatch with how they're handled within the HalfStepSiameseEncoder model.
- Evaluation multiruns not compatible with current script, done manually. See Hydra issue #1258.
- Pre-training data downloading method not implemented, must be dealt with manually. See ACNet section
- README could be more informative. Do not hesitate to open issues!
Master's thesis publication: "Towards Learning Activity Cliff-Aware Molecular Representations"
Full paper/Poster accepted at LXAI @ ICML 2024!
Poster accepted at MoML 2024!
If you found this useful, please use the following citation:
@inproceedings{ValdezCrdova2024,
series = {LXAI at ICML 2024},
title = {Towards Learning Activity Cliff-Aware Molecular Representations},
url = {http://dx.doi.org/10.52591/lxai202407274},
DOI = {10.52591/lxai202407274},
booktitle = {LatinX in AI at International Conference on Machine Learning 2024},
publisher = {Journal of LatinX in AI Research},
author = {Valdez Córdova, César Miguel},
year = {2024},
month = jul,
collection = {LXAI at ICML 2024}
}
Open an issue or drop me a line
you for the interest and the internet
This project is licensed under the MIT License - see the LICENSE file for details.