Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define OPTIMADE regex format #490

Merged
merged 32 commits into from
Mar 22, 2024
Merged
Changes from 7 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
bc34a81
Define OPTIMADE regex format
rartino Jan 6, 2024
2b4c4d1
Fix capitalization on JSON Schema
rartino Jan 6, 2024
ce9d0a1
Minor grammar corrections
rartino Jan 7, 2024
1952025
Clarify anchored and unanchored RE sentence
rartino Jan 14, 2024
f289191
Clarify processing variables in ECMA standard
rartino Jan 14, 2024
37b45f4
Fix links to ECMA standard
rartino Jan 16, 2024
7b462ec
Merge branch 'develop' into regex_format
rartino Jan 17, 2024
ef9c93c
Change punctuation/quotation style to adhere to the "logical" convention
rartino Jan 18, 2024
2207051
Clarify text based on suggestions in review
rartino Feb 9, 2024
6035fde
Adjust regex description WIP check compatibility
rartino Feb 12, 2024
991272d
Merge branch 'develop' into regex_format
rartino Feb 12, 2024
80aa0e6
Improve definition of regex format; be more precise about escapes in …
rartino Feb 14, 2024
c84566d
Remove unnecessary constraint on class order.
rartino Feb 14, 2024
fc411a4
Fix minor grammar error
rartino Feb 14, 2024
6d3e5a2
Capitalize Unicode
rartino Feb 15, 2024
3736ebd
Clean-up and made more stringent
rartino Feb 16, 2024
864747c
Re-revert mistankenly reverted logical quoting style.
rartino Feb 16, 2024
c20ad03
Minor formulation adjustments, correcting rst linebreaks
rartino Feb 16, 2024
fb2e2d3
Remove trailing whitespace
rartino Feb 16, 2024
dcffe75
"tokes" → "tokens" ?
sauliusg Mar 21, 2024
9532e75
Merge branch 'develop' into regex_format
rartino Mar 21, 2024
610a7c1
Apply suggestions from review
rartino Mar 22, 2024
1315ad8
Restructuring to remove repetitions and improve order
rartino Mar 22, 2024
680d4e3
Missing oxford comma
rartino Mar 22, 2024
ec8d7a7
Fix rst formatting issue
rartino Mar 22, 2024
9bc0135
Fix reference to JSON Schema standard before it is discussed
rartino Mar 22, 2024
bda92b6
Formatting fix
rartino Mar 22, 2024
95698f6
Minor text improvement
rartino Mar 22, 2024
cba1227
Delete trailing whitespace
rartino Mar 22, 2024
188b2f6
Fix rendering issue in compatibility notes
rartino Mar 22, 2024
24b811b
Fix unicode character reference
rartino Mar 22, 2024
30e58d8
Merge branch 'develop' into regex_format
rartino Mar 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3963,3 +3963,31 @@ An example of the sparse layout for multidimensional lists with three aggregated
[3,7,19, ["PARTIAL-DATA-REF", ["https://example.db.org/value2"]]]
[4,5,19, [ [11, 110], ["PARTIAL-DATA-REF", ["https://example.db.org/value3"]], [550, 333]]]
["PARTIAL-DATA-END", [""]]

OPTIMADE Regular Expression Format
----------------------------------
This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification.
This format will be referred to as an "OPTIMADE regex."
Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rules may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex).
Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and have to be clarified when this format is referenced.

The format is a subset of the format described in `ECMA-262, section 21.2.1 <https://262.ecma-international.org/11.0/#sec-patterns>`__.
The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 <https://json-schema.org/draft/2020-12/json-schema-core#section-6.4>`__.
However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1.

Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 <https://262.ecma-international.org/11.0/#sec-patterns>`__.
The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 <https://262.ecma-international.org/11.0/#sec-notation>`__.
Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard):

- Individual Unicode characters, as defined by the `JSON specification <https://json-schema.org/draft/2020-12/json-schema-core#RFC8259>`__.
- The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 <https://262.ecma-international.org/11.0/#sec-patterns>`__.
- Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``).
- Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``)
- Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one).
- The beginning-of-input (``^``) and end-of-input (``$``) anchors.
- Simple grouping (``(...)``) and alternation (``|``).

Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``).

The expression matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed.
For example, the OPTIMADE regex "es" matches "expression".
Loading