Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit_hydrogen parameter #741

Open
wants to merge 2 commits into
base: interfaces
Choose a base branch
from

Conversation

padix-key
Copy link
Member

This parameter in interface.rdkit.to_mol() defines whether hydrogen should be explicitly or implicitly included in the created Mol

Copy link

codspeed-hq bot commented Jan 24, 2025

CodSpeed Performance Report

Merging #741 will not alter performance

Comparing padix-key:rdkit (167c31a) with main (2624c37)

Summary

✅ 59 untouched benchmarks

mol = EditableMol(Mol())

has_charge_annot = "charge" in atoms.get_annotation_categories()
for i in range(atoms.array_length()):
rdkit_atom = Atom(atoms.element[i].capitalize())
if has_charge_annot:
rdkit_atom.SetFormalCharge(atoms.charge[i].item())
if explicit_hydrogen:
rdkit_atom.SetNoImplicit(True)
Copy link
Contributor

@Croydon-Brixton Croydon-Brixton Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the base case where explicity_hydrogen=True, what would this mean if I just call to_mol for a typical crystal structure that won't have any hydrogens resolved? Will RDKit infer charges / valences in that case? If so this might lead to broken molecules -- if that were the case, should we check that if explicit hydrogen is true there have to be hydrogens in the structure? (I could see this being sth users will try -- at least I would have tried it :D )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Charges would also not be inferred automatically before this PR. But I am not sure what this means for valence: As the bond types are also explicitly set I guess RDKit assumes a radical, if explicit_hydrogen=True, but hydrogen atoms are actually missing. Probably, I should test this.

Having a check there sounds like a good idea, especially as I would agree that this could be a common mistake. However, I also think strictly checking for the simple presence of hydrogen atoms might not be sensible enough, as there are valid molecules without hydrogen atoms, although they appear rarely. What do you think about raising a warning as a 'reminder' to the user to check the input?

@padix-key
Copy link
Member Author

padix-key commented Feb 2, 2025

I checked the different behaviors when an AtomArray without hydrogen is passed to as_mol() with explicit_hydrogen being False/True:

import biotite.interface.rdkit as rdkit_interface
import biotite.structure.info as info
import rdkit.Chem.AllChem as Chem
import rdkit.Chem.Draw as Draw

atoms = info.residue("C")
atoms = atoms[atoms.element != "H"]

mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=True)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "explicit.png")

mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=False)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "non_explicit.png")

explicit.png:

explicit

non_explicit.png:

non_explicit

So rdkit correctly understands that in the latter case it should additional hydrogen atoms to the visualization, as the Mol already contains the hydrogen implicitly.

@padix-key
Copy link
Member Author

I added a warning in case the structure contains no hydrogen atoms. @Croydon-Brixton Could you have a look again?

@@ -141,17 +139,29 @@ def to_mol(
HB3
HXT
"""
mol = EditableMol(Mol())
hydrogen_mask = atoms.element == "H"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I just remembered that I've seen the very occasional structures with deuterium as well (e.g. 1wq2). How would we want to deal with these cases? Do we want to treat this isotope as hydrogen as well (probably chemically most appropriate) or not? If so, the PDB represents deuterium as the element "D"

for i in range(atoms.array_length()):
rdkit_atom = Atom(atoms.element[i].capitalize())
rdkit_atom = Chem.Atom(atoms.element[i].capitalize())
if explicit_hydrogen:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if explicit_hydrogen:
if explicit_hydrogen:
# ... tell RDKit to not assume any implicit hydrogens

if explicit_hydrogen:
if not hydrogen_mask.any():
warnings.warn(
"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'"
"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'. "
"This may lead to atoms wrongly interpreted as radicals. Be careful."

rdkit_atom = Atom(atoms.element[i].capitalize())
rdkit_atom = Chem.Atom(atoms.element[i].capitalize())
if explicit_hydrogen:
rdkit_atom.SetNoImplicit(True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm to my mind this option is still problematic, at least given the defaults of the to_mol function:
By default most PDBs won't have any hydrogens present, but the default in to_mol here is explicit_hydrogen=True. This will lead to radicals upon calling Chem.Sanitize, which is undesirable almost always and will probably catch a lot of users (especially those less well versed in RDKit) off guard.

I would suggest:

  1. Changing the default to not expect explicit hydrogens.
  2. Emitting a warning if a user says they're having explicit hydrogens, but no hydrogens are found, with a note that this can lead to radicals. After that, they're on their own ^^

@Croydon-Brixton
Copy link
Contributor

Sorry it wouldn't let me add this to the review summary earlier, but here's my analysis that exhibits the radicals issue:

Our base case (usual PDB files + default options)
Screenshot 2025-02-12 at 11 52 39

The other cases:
Screenshot 2025-02-12 at 11 52 52

Screenshot 2025-02-12 at 11 53 01

Screenshot 2025-02-12 at 11 53 12

Here's the code to reproduce:

import biotite.structure as struc
from biotite.interface.rdkit.mol import to_mol, from_mol
import rdkit.Chem.AllChem as Chem
from typing import *

try:
    # Settings for debugging & interactive tests
    from rdkit.Chem.Draw import IPythonConsole

    IPythonConsole.kekulizeStructures = False
    IPythonConsole.drawOptions.addAtomIndices = False
    IPythonConsole.ipython_3d = False
    IPythonConsole.ipython_useSVG = True
    IPythonConsole.drawOptions.addStereoAnnotation = True
    IPythonConsole.molSize = 300, 200
except ImportError:
    pass

tyr = struc.info.residue("TYR")
tyr.chain_id[:] = "A"
tyr.res_id[:] = 17
tyr_no_h = tyr[tyr.element != "H"]

def has_radicals(mol, return_indices: bool = False) -> Union[bool, Tuple[bool, List[int]]]:
    if mol is None:
        raise ValueError("Input molecule is None")
        
    radical_indices = []
    
    for atom in mol.GetAtoms():
        # Get number of radical electrons
        num_radical_electrons = atom.GetNumRadicalElectrons()
        if num_radical_electrons > 0:
            radical_indices.append(atom.GetIdx())
    
    has_any_radicals = len(radical_indices) > 0
    
    if return_indices:
        return has_any_radicals, radical_indices
    return has_any_radicals

def is_sanitized(mol) -> bool:
    if mol is None:
        return False
        
    try:
        # Create a copy to avoid modifying the input
        mol_copy = Chem.Mol(mol)
        
        # Try to sanitize with all checks enabled
        # catchErrors=True prevents exceptions from being raised
        result = Chem.SanitizeMol(mol_copy, catchErrors=True)
        
        # SANITIZE_NONE (0) means all checks passed
        return result == Chem.SanitizeFlags.SANITIZE_NONE
        
    except Exception:
        return False
    
rad_color = lambda has_rad: '\033[32m' if not has_rad else '\033[31m'  # Green if False, Red if True
san_color = lambda is_san: '\033[32m' if is_san else '\033[31m'      # Green if True, Red if False
reset_color = '\033[0m'

i = 0
for m in [tyr_no_h, tyr]:
    for explicit_hydrogen in [True, False]:
        mol = to_mol(m, explicit_hydrogen=explicit_hydrogen)
        Chem.Compute2DCoords(mol)
        print(f"------- {i}: has_H={'H' in m.element}, {explicit_hydrogen=} -------")
        
        # Color the output based on the conditions
        print("before sanitize")
        has_rad = has_radicals(mol)
        is_san = is_sanitized(mol)
        print(f"{rad_color(has_rad)}has_radicals(mol): {has_rad}{reset_color}")
        print(f"{san_color(is_san)}is_sanitized(mol): {is_san}{reset_color}")
        display(mol);

        Chem.SanitizeMol(mol)
        print("after sanitize")
        has_rad = has_radicals(mol)
        is_san = is_sanitized(mol)
        print(f"{rad_color(has_rad)}has_radicals(mol): {has_rad}{reset_color}")
        print(f"{san_color(is_san)}is_sanitized(mol): {is_san}{reset_color}")
        display(mol);

        print("converted back into biotite structure")
        print(repr(from_mol(mol, conformer_id=0)))

        i += 1

@Croydon-Brixton
Copy link
Contributor

Croydon-Brixton commented Feb 12, 2025

Not directly related to the radicals, but another thing I noted while looking at this:
The from_mol function returns an empty stack when no conformer is given for this test case (indeed there is no 3D conformer, but there is a valid molecule with bonded structure). I think we want users to be able to access the underlying rdkit molecule even in the absence of a conformer -- in my work I needed this type of workflow for example to compute some chemical properties for molecules 'in the abstract'. I suggest to just return 'nan' coordinates for those cases (c.f. my suggested edits in the PR below)

Screenshot 2025-02-12 at 11 55 04

@Croydon-Brixton
Copy link
Contributor

@padix-key I've added my suggested changes here for your convenience: padix-key#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants