Shape-Aware Synthon Search (SASS)

Last updated: 2024-11-25

This repo contains the code for running synthon-based shape queries, as described in Shape-Aware Synthon Search (SASS) for virtual screening of synthon-based chemical spaces.

SASS is a synthon-based virtual screening method that carries out shape similarity searches in the synthon space instead of the enumerated product space. Queries are fragmented, and synthons are scored against query fragments to prioritize top synthon combinations. A tiny fraction of the full library is then instantiated and rescored, thereby avoiding full enumeration/scoring and significantly accelerating large-scale, shape-based virtual screening.

Setup

Create a virtual conda environment and install the package

# clone the git repo
conda create -n <env name> python=3.11
conda activate <env name>
pip install -e .

Set OpenEye license path

SASS uses OpenEye Informatics and Modeling toolkits, which requires a valid OpenEye license. If not set already, set the license file path as an environment variable called OE_LICENSE.

export OE_LICENSE=...

Configure compute cluster

SASS uses dask library for parallelization. In this package, dask is configured to run on a SLURM cluster or on a local machine. Users should configure dask for other types of schedulers, and double check the QoS names of their SLURM scheduler.

Usage

Create an experiment directory to hold the results (and intermediate files).
- Intermediate files can be very large (>200 GB), and will be removed after a run is finished.
Save a copy of the config file (./sass/config_template.yml to your experiment directory.
Navigate to your experiment directory.
Edit the config file to specify your query molecule, database, etc.
Run
```
sass run --config </path/to/config>
```

A query run with the full Enamine 2024-02 library (60B products) should take ~1-4 hr on 1000 cpus, depending on the size of the query molecule.

After a run is completed, an SDF file containing the query molecule and the best overlay poses of the top products (rescore_top_<n>_mol_poses.sdf) will be generated. The 1st molecule is the query molecule, and the rest are NOT sorted in descending ROCS scores (you can sort by scores in MOE/mdb when visualizing).

Key terms

synthons: Building blocks that can be directly connected to form products. Synthons are different from reactants in that synthons have labeled connector atoms, and have been modified from the reactants (e.g. leaving groups removed).
connector atom: Dummy atoms (e.g. U, Np, Pu, Am) used to designate connecting vectors where two compatible synthons join.
chemical space: All molecules generated from exhaustively instantiating all combinations of compatible synthons.
query: The input molecule for which shape-similar molecules in the chemical space is to be searched.
query fragment: Fragments of a query molecule, generated by cleaving bonds of the query.

Tasks

The overall workflow of SASS is shown below:

Valid tasks:

full_library_enum: Exhaustively enumerate the chemical space defined by the input synthons.
- input: Reaction and synthon files.
- output: Molecule file(s) in oeb.gz format.
full_library_conf_gen: Generate conformers of the enumerated library using OMEGA. The parameters for OMEGA is defined in the config file.
- input: molecule files.
- output: conformer files in oez format.
full_library_scoring: Score the conformer database against a given query molecule using ROCS.
- input: conformer files; query molecule.
- output: score list written to file (pkl).
gen_synthon_conformers: Generate conformers of the synthons, which is different from generating conformers for full-products, because the former involves forming minimal products, conformation sampling, and restoring original connector atoms.
- input: synthon SMILES.
- output: conformer files in oez format.
score_synthons: Score synthons against query fragments (using ROCS), aggregate individual synthon scores to generate pseudo-product scores, combine the top_m pseudo-products from different reactions. dask: Each dask worker processes the content of one fragment_order and does synthon scoring and score aggregation. Scores from all fragment_orders are then merged into one file.
- input: synthon conformers; query fragments.
- output: combined pseudo-product file for all reactions.
- NOTE: conceptually, synthon scores are organized in the following hierachy. A valid product can be formed by the combining one synthon from each of the synthon_sets under one fragment_order.
```
{
    rxn_id1: {
        fragment_set1: {
            fragment_order1: {
                synthon_set1: [score1, score2, score3 ...],
                synthon_set2: [...],
                ...
            },
            ...
        },
        ...
    },
    ...
}
```
  - rxn_id: Each reaction in the chemical space.
  - fragment_set: Each valid fragment set generated by fragmenting the query.
  - fragment_order: Usually there are multiple ways to order a fragment set to score against a set of synthon_sets. All valids orders of scoring are considered.
  - synthon_set: A set of synthons belonging to the same component of a multi-component reaction.
instantiate_products: Instantiate the top pseudo-products. This step cannot be combined with the rescore_products step, because rescore_products already includes instantiate_products. This task is used when the user just wants the instantiated top_m pseudo products and does not need the ROCS scores of those products.
- input: Pseudo-products file.
- output: Molecule file(s).
rescore_products: Instantiate the top pseudo-products, generate conformers of the instantiated products, score conformers against a given query molecule using ROCS. Scores from all chunks are merged into one file (top_m products/scores are kept).
- input: Pseudo-products file, query molecule.
- output: Real ROCS scores saved to file; conformer file(s) are kept for get_product_poses.
get_product_poses: Score top re-scored products against a query molecule using ROCS and write out the best overlay poses for visualization. Note that the rescore_products step itself only outputs the numerical scores and product ids, not the poses.
- input: Top product conformers, query molecule.
- output: An SDF file containing the query molecule and best overlay poses of top products.

Modes

For regular SASS queries: run the following tasks: score_synthons, rescore_products, get_product_poses. If synthon conformers have not be pre-generated, run gen_synthon_conformers first. If one intends to re-score using a different metric (i.e. not using ROCS), one can run just score_synthons and instantiate_products to obtain the top_m instantiated products.
To validate SASS performance, the ground truth needs to be generated by running the following tasks: full_library_enum, full_library_conf_gen, full_library_scoring.
- For validation, one only needs to run score_synthons (re-scoring is not needed), and the combined pseudo-product scores can be used directly to calculate the recall (comparing to the ground truth).

Parallelization and time/compute resource considerations

Given the data volume and independence of most results, parallelization in computing is highly desirable. Currently, parallelization is achieved using dask for all steps.

library product enumeration: Because exhaustive product enumeration from sets of compatible synthons uses the OpenEye OELibraryGen object, which itself is likely already quite optimized, it is inefficient to loop through the synthons one-by-one. Instead, for an $n \times m \times k \times ...$ sized library, only the 1st "dimension" is split, i.e. each dask worker only enumerates $n[i: i+c] \times m \times k \times ...$ products.
- Tautomer selection and isomer enumeration (FLIPPER) is done on each product.
- The split on the 1st dimension for each reaction is dynamically determined to fit the target product chunk size. e.g. for a two-component reaction of $50 \times 100$ and a target chunk size of $1000$, the 1st dimension is split 5-fold with 10 synthons each (i.e. a total of 5 tasks of $10 \times 100 = 1000$)
- In cases where the actual chunk size exceeds the target chunk size even when the 1st "dimension" is split to the smallest possible chunk (e.g. for a three-component reaction of $100 \times 100 \times 100$, even $1 \times 100 \times 100 = 10000$ is larger than the target chunk size of $1000$), the oversized chunk is still instantiated on 1 worker, but split to the target chunk size when writing to file. Even thought this does not improve the efficiency of library product instantiation, having even chunk sizes greatly improves the parallelizaiton efficiency for library product conformer generation and scoring.
- It is recommended to set the target product chunk size to be at least 1000. Too small chunk size leads to dask error. The actual number of isomers per chunk will be ~3000 due to multiple isomers per product.
product conformer generation: Each dask worker processes one product chunk (see above).
product scoring (ROCS): Each dask worker processes one conformer chunk (see above).
synthon conformer generation: Due to uneven size of different reaction synthons (ranging from 1 to 50000) and relatively small number of reactions (~200), parallelization on reaction level leads to very different runtime for each dask worker. To alleviate this, synthon lists are split to a target size and fed to each dask worker. Conformer chunks are written to files, and the head node merges all chunks based on the file name (designating which reaction and which component) at the end. While it could be more efficient to use more workers to merge chunks, synthon conformer generation is seldom done, so the shortcoming is acceptable.
synthon scoring: Scoring is parallelized on the fragment_order level. For the full Enamine 2020-09 library with a typical query molecule, the number of total fragment_order is on the order of 1e4, which is suitable with ~1000 dask workers. Each task of synthon vs query fragment scoring is not further split.
singleton product instantiation: Unlike library product enumeration, instantiating the SASS-selected pseudo-products does not require enumeration of all combinations of compatible synthons, but only 1 product from each of the aggregated synthon-id. OELibraryGen objects were used, but only with 1 synthon at each reactant position (and 1 expected product per function call).
- It is recommended to set the chunk size to be processed by each dask worker to be no smaller than 1000. Too small chunk size leads to dask error. The instantiated products are further split into the desired chunk size (usually 10-100/chunk) for downstream rescoring.
get_product_poses: Each dask worker looks into each conformer chunk and fetches a subset of conformers based on a set of ids, does overlay with reference molecule, and write out the pose files. The head node then aggregate and deduplicate the poses into one file.

Advanced usage

Restarting from intermediate steps: SASS writes out and stores the following intermediate files. One can specify the paths to those files/folders to restart from an intermediate step.
- Combined pseudo result file: contains top_m synthon combinations to be rescored with ROCS.
- Combined rescore result file: contains ROCS scores of top_m products, can be used to generate overlay poses.
- (Only if get_product_poses step has not been run): Conformers of top_m products, used to generate overlay poses.
Using custom color force field files: Users can generate their own color FF file and provide the path under base_cff_file key in the config file. This file will override the default Mills-Dean force field used by ROCS. This base color FF will be extended with appropriate connector atom color features.
- Users can also re-use an entire set of color FF files by providing the path to the directory under color_ff_dir key.
Custom scoring in ROCS: ROCS uses TanimotoCombo score by default. Users can specify the shape and color score functions to use for overlay optimization and scoring by modifying the [optimization/overlap]_[shape/color]_func keys in the config file. The parameters must be valid public methods of oeshape.OEOverlapResults class.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
data		data
notebooks		notebooks
sass		sass
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shape-Aware Synthon Search (SASS)

Setup

Create a virtual conda environment and install the package

Set OpenEye license path

Configure compute cluster

Usage

Key terms

Tasks

Valid tasks:

Modes

Parallelization and time/compute resource considerations

Advanced usage

About

Releases

Packages

Languages

License

Genentech/SASS

Folders and files

Latest commit

History

Repository files navigation

Shape-Aware Synthon Search (SASS)

Setup

Create a virtual conda environment and install the package

Set OpenEye license path

Configure compute cluster

Usage

Key terms

Tasks

Valid tasks:

Modes

Parallelization and time/compute resource considerations

Advanced usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages