Last updated: 2024-11-25
This repo contains the code for running synthon-based shape queries, as described in Shape-Aware Synthon Search (SASS) for virtual screening of synthon-based chemical spaces.
SASS is a synthon-based virtual screening method that carries out shape similarity searches in the synthon space instead of the enumerated product space. Queries are fragmented, and synthons are scored against query fragments to prioritize top synthon combinations. A tiny fraction of the full library is then instantiated and rescored, thereby avoiding full enumeration/scoring and significantly accelerating large-scale, shape-based virtual screening.
# clone the git repo
conda create -n <env name> python=3.11
conda activate <env name>
pip install -e .
SASS
uses OpenEye Informatics and Modeling toolkits, which requires a valid OpenEye license. If not set already, set the license file path as an environment variable called OE_LICENSE
.
export OE_LICENSE=...
SASS
uses dask
library for parallelization. In this package, dask
is configured to run on a SLURM cluster or on a local machine. Users should configure dask
for other types of schedulers, and double check the QoS names of their SLURM scheduler.
-
Create an experiment directory to hold the results (and intermediate files).
- Intermediate files can be very large (>200 GB), and will be removed after a run is finished.
-
Save a copy of the config file (
./sass/config_template.yml
to your experiment directory. -
Navigate to your experiment directory.
-
Edit the config file to specify your query molecule, database, etc.
-
Run
sass run --config </path/to/config>
A query run with the full Enamine 2024-02 library (60B products) should take ~1-4 hr on 1000 cpus, depending on the size of the query molecule.
After a run is completed, an SDF file containing the query molecule and the best overlay poses of the top products (rescore_top_<n>_mol_poses.sdf
) will be generated. The 1st molecule is the query molecule, and the rest are NOT sorted in descending ROCS scores (you can sort by scores in MOE/mdb when visualizing).
- synthons: Building blocks that can be directly connected to form products. Synthons are different from reactants in that synthons have labeled connector atoms, and have been modified from the reactants (e.g. leaving groups removed).
- connector atom: Dummy atoms (e.g. U, Np, Pu, Am) used to designate connecting vectors where two compatible synthons join.
- chemical space: All molecules generated from exhaustively instantiating all combinations of compatible synthons.
- query: The input molecule for which shape-similar molecules in the chemical space is to be searched.
- query fragment: Fragments of a query molecule, generated by cleaving bonds of the query.
The overall workflow of SASS is shown below:
-
full_library_enum: Exhaustively enumerate the chemical space defined by the input synthons.
- input: Reaction and synthon files.
- output: Molecule file(s) in
oeb.gz
format.
-
full_library_conf_gen: Generate conformers of the enumerated library using OMEGA. The parameters for OMEGA is defined in the config file.
- input: molecule files.
- output: conformer files in
oez
format.
-
full_library_scoring: Score the conformer database against a given query molecule using ROCS.
- input: conformer files; query molecule.
- output: score list written to file (
pkl
).
-
gen_synthon_conformers: Generate conformers of the synthons, which is different from generating conformers for full-products, because the former involves forming minimal products, conformation sampling, and restoring original connector atoms.
- input: synthon SMILES.
- output: conformer files in
oez
format.
-
score_synthons: Score synthons against query fragments (using ROCS), aggregate individual synthon scores to generate pseudo-product scores, combine the
top_m
pseudo-products from different reactions.dask
: Each dask worker processes the content of onefragment_order
and does synthon scoring and score aggregation. Scores from allfragment_orders
are then merged into one file.-
input: synthon conformers; query fragments.
-
output: combined pseudo-product file for all reactions.
-
NOTE: conceptually, synthon scores are organized in the following hierachy. A valid product can be formed by the combining one synthon from each of the
synthon_set
s under onefragment_order
.{ rxn_id1: { fragment_set1: { fragment_order1: { synthon_set1: [score1, score2, score3 ...], synthon_set2: [...], ... }, ... }, ... }, ... }
rxn_id
: Each reaction in the chemical space.fragment_set
: Each valid fragment set generated by fragmenting the query.fragment_order
: Usually there are multiple ways to order a fragment set to score against a set ofsynthon_set
s. All valids orders of scoring are considered.synthon_set
: A set of synthons belonging to the same component of a multi-component reaction.
-
-
instantiate_products: Instantiate the top pseudo-products. This step cannot be combined with the
rescore_products
step, becauserescore_products
already includesinstantiate_products
. This task is used when the user just wants the instantiatedtop_m
pseudo products and does not need the ROCS scores of those products.- input: Pseudo-products file.
- output: Molecule file(s).
-
rescore_products: Instantiate the top pseudo-products, generate conformers of the instantiated products, score conformers against a given query molecule using ROCS. Scores from all chunks are merged into one file (
top_m
products/scores are kept).- input: Pseudo-products file, query molecule.
- output: Real ROCS scores saved to file; conformer file(s) are kept for
get_product_poses
.
-
get_product_poses: Score top re-scored products against a query molecule using ROCS and write out the best overlay poses for visualization. Note that the
rescore_products
step itself only outputs the numerical scores and product ids, not the poses.- input: Top product conformers, query molecule.
- output: An SDF file containing the query molecule and best overlay poses of top products.
-
For regular SASS queries: run the following tasks:
score_synthons
,rescore_products
,get_product_poses
. If synthon conformers have not be pre-generated, rungen_synthon_conformers
first. If one intends to re-score using a different metric (i.e. not using ROCS), one can run justscore_synthons
andinstantiate_products
to obtain thetop_m
instantiated products. -
To validate SASS performance, the ground truth needs to be generated by running the following tasks:
full_library_enum
,full_library_conf_gen
,full_library_scoring
.- For validation, one only needs to run
score_synthons
(re-scoring is not needed), and the combined pseudo-product scores can be used directly to calculate the recall (comparing to the ground truth).
- For validation, one only needs to run
Given the data volume and independence of most results, parallelization in computing is highly desirable. Currently, parallelization is achieved using dask
for all steps.
-
library product enumeration: Because exhaustive product enumeration from sets of compatible synthons uses the OpenEye
OELibraryGen
object, which itself is likely already quite optimized, it is inefficient to loop through the synthons one-by-one. Instead, for an$n \times m \times k \times ...$ sized library, only the 1st "dimension" is split, i.e. each dask worker only enumerates$n[i: i+c] \times m \times k \times ...$ products.- Tautomer selection and isomer enumeration (FLIPPER) is done on each product.
- The split on the 1st dimension for each reaction is dynamically determined to fit the target product chunk size. e.g. for a two-component reaction of
$50 \times 100$ and a target chunk size of$1000$ , the 1st dimension is split 5-fold with 10 synthons each (i.e. a total of 5 tasks of$10 \times 100 = 1000$ ) - In cases where the actual chunk size exceeds the target chunk size even when the 1st "dimension" is split to the smallest possible chunk (e.g. for a three-component reaction of
$100 \times 100 \times 100$ , even$1 \times 100 \times 100 = 10000$ is larger than the target chunk size of$1000$ ), the oversized chunk is still instantiated on 1 worker, but split to the target chunk size when writing to file. Even thought this does not improve the efficiency of library product instantiation, having even chunk sizes greatly improves the parallelizaiton efficiency for library product conformer generation and scoring. - It is recommended to set the target product chunk size to be at least 1000. Too small chunk size leads to
dask
error. The actual number of isomers per chunk will be ~3000 due to multiple isomers per product.
-
product conformer generation: Each dask worker processes one product chunk (see above).
-
product scoring (ROCS): Each dask worker processes one conformer chunk (see above).
-
synthon conformer generation: Due to uneven size of different reaction synthons (ranging from 1 to 50000) and relatively small number of reactions (~200), parallelization on reaction level leads to very different runtime for each dask worker. To alleviate this, synthon lists are split to a target size and fed to each dask worker. Conformer chunks are written to files, and the head node merges all chunks based on the file name (designating which reaction and which component) at the end. While it could be more efficient to use more workers to merge chunks,
synthon conformer generation
is seldom done, so the shortcoming is acceptable. -
synthon scoring: Scoring is parallelized on the
fragment_order
level. For the full Enamine 2020-09 library with a typical query molecule, the number of totalfragment_order
is on the order of 1e4, which is suitable with ~1000 dask workers. Each task of synthon vs query fragment scoring is not further split. -
singleton product instantiation: Unlike library product enumeration, instantiating the SASS-selected pseudo-products does not require enumeration of all combinations of compatible synthons, but only 1 product from each of the aggregated synthon-id.
OELibraryGen
objects were used, but only with 1 synthon at each reactant position (and 1 expected product per function call).- It is recommended to set the chunk size to be processed by each dask worker to be no smaller than 1000. Too small chunk size leads to
dask
error. The instantiated products are further split into the desired chunk size (usually 10-100/chunk) for downstream rescoring.
- It is recommended to set the chunk size to be processed by each dask worker to be no smaller than 1000. Too small chunk size leads to
-
get_product_poses: Each dask worker looks into each conformer chunk and fetches a subset of conformers based on a set of ids, does overlay with reference molecule, and write out the pose files. The head node then aggregate and deduplicate the poses into one file.
-
Restarting from intermediate steps: SASS writes out and stores the following intermediate files. One can specify the paths to those files/folders to restart from an intermediate step.
- Combined pseudo result file: contains
top_m
synthon combinations to be rescored with ROCS. - Combined rescore result file: contains ROCS scores of
top_m
products, can be used to generate overlay poses. - (Only if
get_product_poses
step has not been run): Conformers oftop_m
products, used to generate overlay poses.
- Combined pseudo result file: contains
-
Using custom color force field files: Users can generate their own color FF file and provide the path under
base_cff_file
key in the config file. This file will override the default Mills-Dean force field used by ROCS. This base color FF will be extended with appropriate connector atom color features.- Users can also re-use an entire set of color FF files by providing the path to the directory under
color_ff_dir
key.
- Users can also re-use an entire set of color FF files by providing the path to the directory under
-
Custom scoring in ROCS: ROCS uses TanimotoCombo score by default. Users can specify the shape and color score functions to use for overlay optimization and scoring by modifying the
[optimization/overlap]_[shape/color]_func
keys in the config file. The parameters must be valid public methods ofoeshape.OEOverlapResults
class.