Let's start by finding actives for a dataset where we already have known inactives like OPRK1. The other benefit is that there been many newly published OPRK1 structures.
Then, we can simply add our newly found actives, deduplicate them, and likely dock them without even having access to CCU clusters.
Let's use the criteria from GPCR Bench first, they mentioned that they plan to automatically update GPCR bench. I checked. The files appear to be on the DUD-E website but the downloads either fail or are corrupt. Found their github and tried there. The github is working but it is only the the old data from 9 years ago. Maybe worthwhile to email the authors about either the code, but for now - I can at least get the actives data.
I am getting the zip from github, in the meanwhile - the *.smi files contain the whole set post generation. The names are also given in an .id file.
tail -n 201 OPRK_chembl19_set.smi > bench_actives_oprk1.smi
The 201 ligand is the cocrystal ligand.
We need to check overlap with LIT PCBA OPRK1. I moved the .smi file to here as 'litpcba_actives_oprk1.smi'.
Will ICM be able to check identical mols? There might an issue with charge states, as GPCR Bench allowed for multiple protomers and their smiles are enumerated.
icm/def> find table exact litpcba_actives_oprk1 bench_actives_oprk1.mol select 0.5 ( no ? : charge )
Info> 0 hits selected in litpcba_actives_oprk1
icm/def> find table exact bench_actives_oprk1 litpcba_actives_oprk1.mol select 0.5 ( no ? : charge )
Info> 0 hits selected in bench_actives_oprk1
comparison within the set also looks fine
icm/def> find table bench_actives_oprk1 select index = Index( bench_actives_oprk1.mol exact )
Info> 0 hits selected in bench_actives_oprk1
icm/def> find table litpcba_actives_oprk1 select index = Index( litpcba_actives_oprk1.mol exact )
Info> 0 hits selected in litpcba_actives_oprk1
Okay, so, at the least we have another 200 actives.
Since we should already have the data for lit pcba, we just have to prepare these and dock them.
Unfortunately, I think my receptor is not on this laptop?
Either way:
- Rdkit
- Ligprep with no additional protonation
- Dock into grid
- Strain
- merge dfs
- performance metrics
I think it is best to just get my automation scripts back and just get actives for the receptors thomas wanted.
We can return back to OPRK1 and work towards others.
Let's a model a workflow, an initial search of OPRK1
returns this link
There are already a lot of options to chose from. There are choices about species and 'groups' (I would guess this is for non selective molecules). I am not sure exactly how we would want to filter these. Protein species should probably be tied to our PDB availability, but practically - we want to use human PDBs. Selectivity is a bit more nebulous but I would avoid stuff like this:
ChEMBL ID | Name | UniProt Accessions | Type |
---|---|---|---|
CHEMBL2095192 | Opioid receptor | P42866|P32300|P33534 | PROTEIN FAMILY |
Instead we'll look at this:
ChEMBL ID | Name | UniProt Accessions | Type | Organism | Compounds |
---|---|---|---|---|---|
CHEMBL237 | Kappa opioid receptor | P41145 | SINGLE PROTEIN | Homo sapiens | 7926 |
So now, we'll follow the CHEMBL237 link. This is the target card information. It takes a very long time to load.
However, all we really want is probably the bioactivities. We get 14,710 records. How do we filter this down? These are all the options available.
molecule_chembl_id | Molecule ChEMBL ID | string |
---|---|---|
molecule_pref_name | Molecule Name | string |
_metadata.parent_molecule_data.max_phase | Molecule Max Phase | string |
_metadata.parent_molecule_data.full_mwt | Molecular Weight | double |
_metadata.parent_molecule_data.num_ro5_violations | #RO5 Violations | integer |
_metadata.parent_molecule_data.alogp | AlogP | double |
_metadata.parent_molecule_data.compound_key | Compound Key | string |
canonical_smiles | Smiles | string |
standard_type | Standard Type | string |
standard_relation | Standard Relation | string |
standard_value | Standard Value | double |
standard_units | Standard Units | string |
pchembl_value | pChEMBL Value | double |
data_validity_comment | Data Validity Comment | string |
activity_comment | Comment | string |
uo_units | Uo Units | string |
ligand_efficiency.bei | Ligand Efficiency BEI | double |
ligand_efficiency.le | Ligand Efficiency LE | double |
ligand_efficiency.lle | Ligand Efficiency LLE | double |
ligand_efficiency.sei | Ligand Efficiency SEI | double |
potential_duplicate | Potential Duplicate | integer |
assay_chembl_id | Assay ChEMBL ID | string |
assay_description | Assay Description | string |
assay_type | Assay Type | string |
bao_format | BAO Format ID | string |
bao_label | BAO Label | string |
_metadata.assay_data.assay_organism | Assay Organism | string |
_metadata.assay_data.tissue_chembl_id | Assay Tissue ChEMBL ID | string |
_metadata.assay_data.assay_tissue | Assay Tissue Name | string |
_metadata.assay_data.assay_cell_type | Assay Cell Type | string |
_metadata.assay_data.assay_subcellular_fraction | Assay Subcellular Fraction | string |
_metadata.assay_data.assay_parameters | Assay Parameters | object |
assay_variant_accession | Assay Variant Accession | string |
assay_variant_mutation | Assay Variant Mutation | string |
target_chembl_id | Target ChEMBL ID | string |
target_pref_name | Target Name | string |
target_organism | Target Organism | string |
_metadata.target_data.target_type | Target Type | string |
document_chembl_id | Document ChEMBL ID | string |
src_id | Source ID | string |
_metadata.source.src_description | Source Description | string |
document_journal | Document Journal | string |
document_year | Document Year | integer |
_metadata.assay_data.cell_chembl_id | Cell ChEMBL ID | string |
activity_properties | Properties | object |
action_type.action_type | Action Type | string |
And for reference, this is the query if we do need to move to something programmatic:
Index Name
chembl_activity
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "target_chembl_id:CHEMBL237"
}
}
],
"filter": [
{
"query_string": {
"analyze_wildcard": true,
"query": "target_chembl_id:CHEMBL237"
}
}
],
"should": [],
"must_not": []
}
},
"track_total_hits": true,
"size": 20,
"from": 0,
"_source": [
"molecule_chembl_id",
"_metadata.parent_molecule_data.compound_key",
"standard_type",
"standard_relation",
"standard_value",
"standard_units",
"pchembl_value",
"activity_comment",
"assay_chembl_id",
"assay_description",
"bao_label",
"_metadata.assay_data.assay_organism",
"target_chembl_id",
"target_pref_name",
"target_organism",
"_metadata.target_data.target_type",
"document_chembl_id",
"_metadata.source.src_description",
"_metadata.assay_data.cell_chembl_id",
"molecule_pref_name",
"_metadata.parent_molecule_data.max_phase",
"_metadata.parent_molecule_data.full_mwt",
"_metadata.parent_molecule_data.num_ro5_violations",
"_metadata.parent_molecule_data.alogp",
"canonical_smiles",
"data_validity_comment",
"uo_units",
"ligand_efficiency.bei",
"ligand_efficiency.le",
"ligand_efficiency.lle",
"ligand_efficiency.sei",
"potential_duplicate",
"assay_type",
"bao_format",
"_metadata.assay_data.tissue_chembl_id",
"_metadata.assay_data.assay_tissue",
"_metadata.assay_data.assay_cell_type",
"_metadata.assay_data.assay_subcellular_fraction",
"_metadata.assay_data.assay_parameters",
"assay_variant_accession",
"assay_variant_mutation",
"src_id",
"document_journal",
"document_year",
"activity_properties",
"action_type.action_type"
],
"sort": []
}
Here are there filters:
Property | Filter |
---|---|
Ki, Kd, IC50, or EC50 | <= 10,000 nM |
Molecular Weight (MW) | 250 ≤ MW ≤ 500 |
LogP | -2 ≤ LogP ≤ 5 |
Hydrogen bond donors (HBD) | HBD ≤ 4 |
Hydrogen bond acceptors (HBA) | 8 ≤ HBA ≤ 10 |
Rotatable bonds | Rotatable bonds < 8 |
Rings | Rings < 4 |
9-membered rings | 0 |
Isotopes | 0 |
Sulfur (S) atoms | S ≤ 3 |
Chlorine (Cl) atoms | Cl ≤ 3 |
Bromine (Br) atoms | Br ≤ 2 |
Iodine (I) atoms | I ≤ 2 |
Fluorine (F) atoms | F ≤ 7 |
Heteroatoms | Only S, F, Cl, Br, I, N, O |
MW < 250 | LogP < 4.5 |
PAINS | Filter Pains |
Functional groups | "Specific filters apply" |
Not all of this is possible to filter in ChEMBL as far as I can tell. There should be a way to do this relatively easily in ICM. I would probably just try to filter the activity data as such before hand.
After
Property | Filter |
---|---|
Ki, Kd, IC50, or EC50 | <= 10,000 nM |
LogP | -2 ≤ LogP ≤ 5 |
We are already down to 5,452. I grabbed the TSV/CSV now as the site seems unstable. Doing this in ICM without dark mode would burn my eyes out of my skull. I can do that in the morning. The other thing we should filter are the two data sets they say are unreliable.
Here is the query just in case:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "target_chembl_id:CHEMBL237"
}
}
],
"filter": [
{
"query_string": {
"analyze_wildcard": true,
"query": "target_chembl_id:CHEMBL237"
}
},
{
"bool": {
"should": [
{
"terms": {
"standard_type": [
"Ki",
"EC50",
"IC50"
]
}
}
]
}
},
{
"range": {
"_metadata.parent_molecule_data.alogp": {
"gte": -2,
"lt": 5
}
}
},
{
"range": {
"standard_value": {
"gte": -56,
"lt": "10000"
}
}
}
],
"should": [],
"must_not": []
}
},
"track_total_hits": true,
"size": 20,
"from": 0,
"_source": [
"molecule_chembl_id",
"_metadata.parent_molecule_data.compound_key",
"standard_type",
"standard_relation",
"standard_value",
"standard_units",
"pchembl_value",
"activity_comment",
"assay_chembl_id",
"assay_description",
"bao_label",
"_metadata.assay_data.assay_organism",
"target_chembl_id",
"target_pref_name",
"target_organism",
"_metadata.target_data.target_type",
"document_chembl_id",
"_metadata.source.src_description",
"_metadata.assay_data.cell_chembl_id",
"molecule_pref_name",
"_metadata.parent_molecule_data.max_phase",
"_metadata.parent_molecule_data.full_mwt",
"_metadata.parent_molecule_data.num_ro5_violations",
"_metadata.parent_molecule_data.alogp",
"canonical_smiles",
"data_validity_comment",
"uo_units",
"ligand_efficiency.bei",
"ligand_efficiency.le",
"ligand_efficiency.lle",
"ligand_efficiency.sei",
"potential_duplicate",
"assay_type",
"bao_format",
"_metadata.assay_data.tissue_chembl_id",
"_metadata.assay_data.assay_tissue",
"_metadata.assay_data.assay_cell_type",
"_metadata.assay_data.assay_subcellular_fraction",
"_metadata.assay_data.assay_parameters",
"assay_variant_accession",
"assay_variant_mutation",
"src_id",
"document_journal",
"document_year",
"activity_properties",
"action_type.action_type"
],
"sort": []
}
I can apply this same sort of logic to whatever GPCRs we want. As long as Thomas agrees we should be fine. In the morning we can discuss
- Decoys or known inactives? Both?
- What ratio of actives/inactives?
- What makes a good benchmark? Both a "worst case" and a "best case" seem reasonable. A prospective study definitely warrants a best case.
We can find inactives by doing the opposite property search, greater than the activity cut offs, >10,000 nM at Ki, Kd, IC50, or EC50.
Long Query For OPRK, Ki, Kd (none), IC50, EC50 >10,000 nM
A very annoying thing about ChEMBL is that once you set some filters in their GUI, it locks you from adding new types of filters. So you have to literally clear all your filters and start over. I don't know how we would filter the inactives anyway. Now we can get this into ICM and filter it.
Filtering in ChEMBL directly is infuriating, I downloaded all the data for OPRK, I am just gonna do the whole thing in ICM. This interface is awful.
test