Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES property #392

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
19 changes: 19 additions & 0 deletions optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2500,6 +2500,25 @@ chemical\_formula\_anonymous

- A filter that matches an exactly given formula is :filter:`chemical_formula_anonymous="A2B"`.

smiles
merkys marked this conversation as resolved.
Show resolved Hide resolved
~~~~~~

- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit more clarification of the expected use.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

  • co-crystal with two distinct molecules (does SMILES do something fancy for this already?)
  • an inorganic surface with adsorbed molecule
  • a hybrid perovskite structure with molecular unit as a cation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string?

This is a good point. I would say that every "site" has to be represented in SMILES. There surely will be situations where this is not attainable (i.e., OpenSMILES cannot express polymers and there will be difficulties in depicting mixture sites). Maybe at this point it would be easier to say that only the structures that are "expressible" using OpenSMILES should have smiles, that is, no nonstandard approximations should be done.

Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

  • co-crystal with two distinct molecules (does SMILES do something fancy for this already?)

SMILES can contain many distinct molecules, disconnected components are joined with . (if I get the question right)

  • an inorganic surface with adsorbed molecule
  • a hybrid perovskite structure with molecular unit as a cation

I would say these two fall under class "polymer", thus inexpressible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

co-crystal with two distinct molecules (does SMILES do something fancy for this already?
This can be described using a "dot bond", e.g. CuSO4.O.O

- **Type**: string
- **Requirements/Conventions**:

- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: Support for queries on this property is OPTIONAL.
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
That is, providers MUST NOT perform substructure search, just regular string comparison.
Comment on lines +2512 to +2513
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
That is, providers MUST NOT perform substructure search, just regular string comparison.

A molecule can have hundreds of valid SMILES descriptors. A client would have to include all of them in a query, to determine whether a particular molecule is present in the database.
I can imagine that such a query would be slow to execute.
A more efficient way, would be to convert the SMILES string of the query into a structure and then back into a SMILES string using the same method that was used to generate the SMILES strings in the database.
These lines however explicitly forbid databases from implementing this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JPBergsma, are you OK with leaving these two lines intact and marking the conversation as resolved?

From what I understand from the discussions in #392, it was agreed to implement the complex structure search functionality in a different way (e.g. by using SMARTS).

- Value MUST adhere to the `OpenSMILES specification v1.0 <http://opensmiles.org/opensmiles.html>`__.
- When structures or their parts cannot be unambiguously represented in SMILES according to OpenSMILES recommendations, using the guidelines from `Quirós et al. 2018 <https://doi.org/10.1186/s13321-018-0279-6>`__ is RECOMMENDED.
- Providers MAY canonicalize (i.e., use rules to establish stable order of atoms) produced SMILES representations, but this is not mandatory.
Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.
merkys marked this conversation as resolved.
Show resolved Hide resolved

- **Examples**:
- caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`

dimension\_types
~~~~~~~~~~~~~~~~

Expand Down