Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES data type #436

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

merkys
Copy link
Member

@merkys merkys commented Dec 7, 2022

In #368 SMILES property for structures was proposed. PR #392 was proposed to introduce string-valued SMILES property, however, there were opinions that internal semantics of SMILES strings have to be respected, what does not fit nicely with "plain string" data type. Thus the current PR introduces both smiles data type and an associated property.

Fixes #368.

@merkys merkys added type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. PR/requires-discussion labels Dec 7, 2022
@merkys merkys mentioned this pull request Dec 7, 2022
Equality comparisons ('=' and '!=') MUST be supported for SMILES values.
When handling equality comparisons of SMILES values, an implementation SHOULD NOT regard them as simple strings.
Instead, an implementation SHOULD either compare the described chemical structures or canonicalize SMILES representations and then perform direct string matching.
In addition to equality comparison operators, :val:`CONTAINS` MAY be supported optionally as an operator to check whether one structure is a substructure of another.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this line is not entirely clear. Does this mean that we support querying for chemical groups ? As would be defined with SMARTS query language ? In that case we should probably mention the SMARTS query language.
Or do we only support searching for whole molecules in the SMILES string which could be separated by a "."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this deserves some clarification. I would not introduce SMARTS yet, but it is worth explaining what smiles CONTAINS "c1ccccc1" means.

When I was putting this PR together, I was thinking about substructure search. That is, "c1ccccc1" would as well be found in fluorobenzene. But we may limit ourselves to complete match of whole molecular entities (i.e., parts of SMILES separated by .). Which use would have better cost/benefit ratio?

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@JPBergsma JPBergsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this nice PR.
Apart from the two small comments, it looks good to me.

optimade.rst Outdated Show resolved Hide resolved
@vaitkus
Copy link
Contributor

vaitkus commented Dec 22, 2022

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

@merkys
Copy link
Member Author

merkys commented Dec 30, 2022

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

@ml-evs
Copy link
Member

ml-evs commented Dec 30, 2022

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

@rartino
Copy link
Contributor

rartino commented Dec 30, 2022

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

Reading your link:


Implementations MAY support custom format attributes. Save for agreement between parties, schema authors SHALL NOT expect a peer implementation to support such custom format attributes. An implementation MUST NOT fail to collect unknown formats as annotations. When the Format-Assertion vocabulary is specified, implementations MUST fail upon encountering unknown formats. Vocabularies do not support specifically declaring different value sets for keywords. Due to this limitation, and the historically uneven implementation of this keyword, it is RECOMMENDED to define additional keywords in a custom vocabulary rather than additional format attributes if interoperability is desired.


So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

A happy side note: it appears the JSON Schema regex situation is more clear now than when we looked into it last time. They now define a "least common denominator subset" regex standard.

https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions

To me, this seems to finally provide a solution to the regular expression dilemma discussions (#42 #160). We should now simply follow their lead and standardize this both for our filter language and for format: regex in property definitions (which presently are not allowed at all). In contrast to JSON Schema, our property definitions should say that the regex MUST only use the subset.

@rartino rartino mentioned this pull request Dec 30, 2022
@merkys
Copy link
Member Author

merkys commented Jan 17, 2023

So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

@rartino
Copy link
Contributor

rartino commented Jan 17, 2023

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

@merkys
Copy link
Member Author

merkys commented Jan 17, 2023

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

I am OK to wait. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/requires-discussion type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SMILES property
5 participants