-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMILES data type #436
base: develop
Are you sure you want to change the base?
SMILES data type #436
Conversation
Equality comparisons ('=' and '!=') MUST be supported for SMILES values. | ||
When handling equality comparisons of SMILES values, an implementation SHOULD NOT regard them as simple strings. | ||
Instead, an implementation SHOULD either compare the described chemical structures or canonicalize SMILES representations and then perform direct string matching. | ||
In addition to equality comparison operators, :val:`CONTAINS` MAY be supported optionally as an operator to check whether one structure is a substructure of another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this line is not entirely clear. Does this mean that we support querying for chemical groups ? As would be defined with SMARTS query language ? In that case we should probably mention the SMARTS query language.
Or do we only support searching for whole molecules in the SMILES string which could be separated by a "."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this deserves some clarification. I would not introduce SMARTS yet, but it is worth explaining what smiles CONTAINS "c1ccccc1"
means.
When I was putting this PR together, I was thinking about substructure search. That is, "c1ccccc1"
would as well be found in fluorobenzene. But we may limit ourselves to complete match of whole molecular entities (i.e., parts of SMILES separated by .
). Which use would have better cost/benefit ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this nice PR.
Apart from the two small comments, it looks good to me.
Co-authored-by: Antanas Vaitkus <[email protected]>
Co-authored-by: Antanas Vaitkus <[email protected]>
The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type. |
Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it. |
It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2 Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what |
Reading your link:
So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think A happy side note: it appears the JSON Schema regex situation is more clear now than when we looked into it last time. They now define a "least common denominator subset" regex standard. https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions To me, this seems to finally provide a solution to the regular expression dilemma discussions (#42 #160). We should now simply follow their lead and standardize this both for our filter language and for format: regex in property definitions (which presently are not allowed at all). In contrast to JSON Schema, our property definitions should say that the regex MUST only use the subset. |
…add much of a benefit for the specification.
|
Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add |
I am OK to wait. Thanks. |
…o SMILES-data-type
Co-authored-by: Johan Bergsma <[email protected]>
In #368 SMILES property for structures was proposed. PR #392 was proposed to introduce string-valued SMILES property, however, there were opinions that internal semantics of SMILES strings have to be respected, what does not fit nicely with "plain string" data type. Thus the current PR introduces both
smiles
data type and an associated property.Fixes #368.