Multi-Level Featurization of GPCR-Ligand Interaction Patterns and Prediction of Ligand Functions from Selectivity to Biased Activation
In this work, machine learning-based classifiers have been developed to elucidate the diverse interactions inherent in GPCR protein-ligand complexes. To gain deeper insights into the factors influencing these interactions, SHAP (SHapley Additive exPlanations) plots are employed, providing an interpretative framework to understand the contribution of each feature in the machine learning model.
First, download and install Anaconda or Miniconda based on your preference.
Clone or download this GitHub repository to your local machine. If you're familiar with Git, you can clone the repository using the following command:
git clone https://github.com/college-of-pharmacy-gachon-university/GPCR-IPL_Score.git
If you're downloading the ZIP file, ensure to unzip it in your desired directory.
Navigate to the directory containing the environment.yml
file within the terminal (Linux, macOS) or Anaconda Prompt (Windows). Then, create the gpcr-ipl
conda environment by executing:
conda env create -f environment.yml
This command reads the environment.yml
file and installs all specified packages in a new conda environment named gpcr-ipl
. Wait for the process to complete as it may take some time depending on your internet speed and system performance.
Once the environment installation is complete, you can activate the gpcr-ipl
environment using:
conda activate gpcr-ipl
With the gpcr-ipl
environment activated, you're now ready to use the Environment. Follow the specific instructions provided for running scripts or analyzing your data within this environment.
-
Maestro from Schrodinger Suite - Any version (either commercial or free) is accepted. For more information and to obtain Maestro, visit Schrodinger's website.
-
KNIME Analytics Platform - Version 4.1.4 recommended. Download KNIME from here.
-
FuzCav - Available for download at this link. After downloading, unpack the
FuzCav
package on your local machine. -
ICHEM - Obtainable from here. A license for ICHEM can be requested by contacting Dr. Didier Rognan at [email protected]. After downloading the
ICHEM
package, please follow the provided instructions to install and configure it properly.Set the environment variables for ICHEM using the commands below in your terminal. Replace the paths with the actual locations where you have stored the ICHEM license file and library on your machine:
export ICHEM_LIC=~/IChem/IChem_gachon.lic export ICHEM_LIB=~/IChem/lib
-
OpenEye Software - To download, please visit OpenEye Software Download Page. Note that an OpenEye license is required to execute Szybki tasks.
This procedure outlines the steps necessary to compute optimized features for G-protein-coupled receptors (GPCRs). It involves data collection, preprocessing, and preparation of GPCR complexes.
- Data Acquisition: Retrieve GPCR protein-ligand data from the following sources: a) GPCRdb: https://gpcrdb.org/structure b) GPCR-EXP: Access GPCR-EXP, and from the download section, procure the Superposed GPCRs file named pdb_overlays.tar.gz. (https://zhanggroup.org/GPCR-EXP)
- Data Curation: Post-acquisition, perform a comparative analysis of the collected Protein Data Bank (PDB) files for GPCRs to identify and eliminate any redundant entries, ensuring a unique dataset.
-
Utilize the "Protein Preparation Wizard" feature within the Schrödinger Suite for the systematic preparation of the GPCR complexes. This preparation includes the following critical steps:
-> Repairing any missing atoms in protein structures.
-> Adding hydrogen atoms to the structures.
-> Assigning appropriate protonation states to amino acid residues.
-> Executing a minimization process for the heavy atoms to stabilize the structures.
-
Splitting GPCR Complexes: Post-preparation, divide the GPCR complexes into two files per complex: {pdbid}_protein.mol2 and {pdbid}_ligand.mol2. ({pdbid} represents the actual PDB ID of the respective protein).
-
File Organization: Create a main folder and organize the files as follows:
{pdbid}/{pdbid}_protein.mol2 {pdbid}/{pdbid}_ligand.mol2
-
Batch Optimization: Execute the optimization process using the script located in the Processing Folder:
sh run_szybki.sh
-
Post-optimization, focus on the following two files for further processing:
{pdbid}_opt_7.0_PB.mol2 {pdbid}_opt_protein.pdb
-
Pocket Extraction: From {pdbid}_opt_protein.pdb, extract the pocket using {pdbid}_opt_7.0_PB.mol2 as the bound ligand (including residues within 7.0 Å from the ligand). Save this as {pdbid}_opt_pocket.mol2. This can be done using Maestro or other relevant software.
-
File Conversion: Convert {pdbid}_opt_protein.pdb into {pdbid}_opt_protein.mol2.
-
Upon completion of Step 3, you should have the following files:
{pdbid}_opt_7.0_PB.mol2 {pdbid}_opt_pocket.mol2 {pdbid}_opt_protein.mol2
- GPCRdb Processing: Use the {pdbid}_opt_protein.pdb file from Step 3 (Job #4), to generate a GPCR generic residue numbering system file for each protein at GPCRdb Generic Numbering Index (https://gpcrdb.org/structure/generic_numbering_index). This will produce a {pdbid}_protein_GPCRDB.pdb file.
- Conversion to CSV: Utilize the Jupyter notebook in the
Data Preparation Folder
,Convert_GPCRPDBs_to_PandasDataframe.ipynb
, for batch conversion of {pdbid}_protein_GPCRDB.pdb into {pdbid}_protein_GPCRDB.csv.
-
Interaction Feature (INT_Feat): Generate using {pdbid}_opt_protein.mol2 and {pdbid}_opt_7.0_PB.mol2 as inputs. The script for batch execution is in the
Feature Generation
Folder:sh INT_Feat_Gen.sh
-
Pocket Feature (POCK_Feat): Generate using {pdbid}_opt_pocket.mol2 as input. Execute the following scripts in sequence for batch processing in the
Feature Generation
Folder:sh step1.POCK_Feat_Gen_tagged.sh sh step2.POCK_Feat_Gen_listtagged.sh sh step3.POCK_Feat_Gen_fp.sh
-
Ligand Feature (LIG_Feat): Save all the Optimized ligand from GPCR PDBs, into
*.sdf file
, process into an E3FP fingerprint and save as a CSV file. Use the following script in theFeature Generation
Folder for batch execution:python LIG_Feat_Gen.py
-
Feature Matrix Creation: Use KNIME to create a feature matrix. Required input files:
a. All
{pdbid}_protein_GPCRDB.csv
. (from Step 4) b. All Interaction feature file*.ifp
(from Step 5 (job # 1)) c. Pocket feature file*.txt
(from Step 5 (job # 2)) d. Ligand feature file*.csv
(from Step 5 (job # 3)) -
Process these inputs using the KNIME workflow located in the
Feature Embedding
Folder:GPCR_KNIME_WORKFLOW.knwf
-
This workflow compiles all features into a CSV file suitable for machine learning model development.
-
Model Building and Analysis: Train, validate, and test binary and biased activation classification models using Jupyter Notebook files in the
Model Building
Folder:BINARY_OPT_GPCR_CLASSIFICATION_MODELS.ipynb BIASED_ACTIVATION_GPCR_CLASSIFICATION_MODELS.ipynb
-
Further analyze features using the SHAP Python library by running Jupyter Notebooks in the
Model Analysis
Folder:BINARY_OPT_GPCR_CLASSIFICATION_MODELS_SHAP_ANALYSIS.ipynb BIASED_ACTIVATION_GPCR_CLASSIFICATION_MODELS_SHAP_ANALYSIS.ipynb