From bc34a8162449c5304c13a14eb0038dd2402db0dd Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Sat, 6 Jan 2024 14:37:16 +0100 Subject: [PATCH 01/28] Define OPTIMADE regex format --- optimade.rst | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/optimade.rst b/optimade.rst index 948143bc2..7ecc6cdc4 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3960,3 +3960,29 @@ An example of the sparse layout for multidimensional lists with three aggregated [3,7,19, ["PARTIAL-DATA-REF", ["https://example.db.org/value2"]]] [4,5,19, [ [11, 110], ["PARTIAL-DATA-REF", ["https://example.db.org/value3"]], [550, 333]]] ["PARTIAL-DATA-END", [""]] + +OPTIMADE Regular Expression Format +---------------------------------- +This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification. +This format will be referred to as an "OPTIMADE regex." +Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rule may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). +Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and has to be clarified when this format is referenced. + +The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. +The format is closely inspired by the subset recommended in the JSON schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. +However, OPTIMADE has decided to restrict the subset further to better align with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. + +Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ (when processed with Unicode support) that only uses the following tokens and features (this list is partially quoted from the JSON Schema standard): + +- Individual Unicode characters, as defined by the `JSON specification `__. +- The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 `__. +- Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``). +- Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``) +- Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one). +- The beginning-of-input (``^``) and end-of-input (``$``) anchors. +- Simple grouping (``(...)``) and alternation (``|``). + +Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). + +Unless explicitly anchored, the expression is evaluated unanchored. +For example, the OPTIMADE regex "es" matches "expression". From 2b4c4d1a6e3afd6b061ead1899c12f2fb086a618 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Sat, 6 Jan 2024 14:50:29 +0100 Subject: [PATCH 02/28] Fix capitalization on JSON Schema --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 7ecc6cdc4..f1203af38 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3969,7 +3969,7 @@ Depending on the context, a delimiter may be required to enclose the regex (e.g. Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and has to be clarified when this format is referenced. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. -The format is closely inspired by the subset recommended in the JSON schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. +The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. However, OPTIMADE has decided to restrict the subset further to better align with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ (when processed with Unicode support) that only uses the following tokens and features (this list is partially quoted from the JSON Schema standard): From ce9d0a1125f9c47f24a98d7415bd48e6035dcd87 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Mon, 8 Jan 2024 00:46:42 +0100 Subject: [PATCH 03/28] Minor grammar corrections Co-authored-by: Antanas Vaitkus --- optimade.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/optimade.rst b/optimade.rst index f1203af38..f263e6c4a 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3965,12 +3965,12 @@ OPTIMADE Regular Expression Format ---------------------------------- This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification. This format will be referred to as an "OPTIMADE regex." -Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rule may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). -Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and has to be clarified when this format is referenced. +Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rules may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). +Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. -However, OPTIMADE has decided to restrict the subset further to better align with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. +However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ (when processed with Unicode support) that only uses the following tokens and features (this list is partially quoted from the JSON Schema standard): From 19520256ed384bb270cd034207b5c5de3ee22fe5 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Mon, 15 Jan 2024 00:04:07 +0100 Subject: [PATCH 04/28] Clarify anchored and unanchored RE sentence --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index f263e6c4a..5bf384ed3 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3984,5 +3984,5 @@ Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, sect Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). -Unless explicitly anchored, the expression is evaluated unanchored. +The expression matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". From f289191852a9b14dd51cab02d229d063f64fc326 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Mon, 15 Jan 2024 00:40:50 +0100 Subject: [PATCH 05/28] Clarify processing variables in ECMA standard --- optimade.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 5bf384ed3..e8b470a57 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3972,7 +3972,9 @@ The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. -Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ (when processed with Unicode support) that only uses the following tokens and features (this list is partially quoted from the JSON Schema standard): +Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__. +The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by [Sec. 21.2.2.1](https://262.ecma-international.org/11.0/#sec-notation). +Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): - Individual Unicode characters, as defined by the `JSON specification `__. - The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 `__. From 37b45f4250f754b726183bec3175f68e37a4ab91 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Tue, 16 Jan 2024 11:25:40 +0100 Subject: [PATCH 06/28] Fix links to ECMA standard Co-authored-by: Antanas Vaitkus --- optimade.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/optimade.rst b/optimade.rst index e8b470a57..e77fa6293 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3968,16 +3968,16 @@ This format will be referred to as an "OPTIMADE regex." Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rules may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. -The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. +The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. -Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__. -The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by [Sec. 21.2.2.1](https://262.ecma-international.org/11.0/#sec-notation). +Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__. +The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): - Individual Unicode characters, as defined by the `JSON specification `__. -- The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 `__. +- The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 `__. - Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``). - Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``) - Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one). From ef9c93c7902a7fe348c69419d01d5c2fecc2ddab Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Thu, 18 Jan 2024 11:37:31 +0100 Subject: [PATCH 07/28] Change punctuation/quotation style to adhere to the "logical" convention Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com> --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 9d91a7f6c..82bfb187a 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3967,7 +3967,7 @@ An example of the sparse layout for multidimensional lists with three aggregated OPTIMADE Regular Expression Format ---------------------------------- This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification. -This format will be referred to as an "OPTIMADE regex." +This format will be referred to as an "OPTIMADE regex". Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rules may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. From 220705193544884b5c40d72ff809ddf50b92f13b Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 9 Feb 2024 16:46:38 +0100 Subject: [PATCH 08/28] Clarify text based on suggestions in review --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 82bfb187a..96cdb7e06 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3975,7 +3975,7 @@ The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. -Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__. +Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): From 6035fde1beb1f93633a9758b54c0a3b77375f534 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Mon, 12 Feb 2024 13:38:54 +0100 Subject: [PATCH 09/28] Adjust regex description WIP check compatibility --- optimade.rst | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/optimade.rst b/optimade.rst index 96cdb7e06..6acfbca70 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3968,19 +3968,22 @@ OPTIMADE Regular Expression Format ---------------------------------- This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification. This format will be referred to as an "OPTIMADE regex". -Depending on the context, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character), and some outer-level escape rules may apply (e.g., to distinguish an enclosing double quote from one that is part of the regex). -Such delimiters and escape rules are not defined as part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. +Depending on the context in which an OPTIMADE regex appear, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character) and some outer-level escape rules may apply. +Such delimiters and outer escape rules are not a part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. +For example, if the string representing the regex is serialized as part of JSON data, the JSON escape rules for strings apply to distinguish an enclosing double quote from one that is part of the regex string. +The format documented here applies to the string after the JSON data has been deserialised. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. -However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified that the escape character token can be used with the meaning defined by ECMA-262, section 21.2.1. +However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified which characters can be escaped. +The intent is that the specified format also is a subset of the `PCRE2 regex format `__, making the format directly useful without translation in a wide range of regex implementations. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): - Individual Unicode characters, as defined by the `JSON specification `__. -- The escape character (``\``) with the functionality described in `ECMA-262, section 21.2.1 `__. +- A literal escape of one of the syntax characters or the character ``/``, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } | /`` to represent that literal character. - Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``). - Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``) - Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one). @@ -3988,6 +3991,23 @@ Furthermore, it can only use the following tokens and features (this list is par - Simple grouping (``(...)``) and alternation (``|``). Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). +Furthermore, there is no support for character classes shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a unicode character by its code point. The expression matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". + +OPTIMADE regexes that utilizes tokes and features documented by ECMA-262 beyond the designated subset is allowed to have an undefined behavior,i.e., it MAY match or not match any string, or MAY produce an error. +Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. + + Compatibility notes: + + Since the specification tolerates regexes using tokens and features beyond the defined subset (with undefined behavior), a regex can be directly handed over to an internal regex engine as long as it is compatible with the defined subset without need for validation or translation. + Compatibility with other regex formats may change between language versions and options provided to the respective implementation. + However, using third-party sources, e.g., the [Regular Expression Engine Comparison Chart](https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816)), we have collected the following information as a general guide [TODO WIP: check this]: + + * The following regex formats appear to be compatible: ECMAScript, PCRE (both v1 and v2), POSIX ERE, Python, Ruby, Tcl ARE, Java, .NET, MySQL, XPath, `MongoDB `__, `MS SQL Server `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `Snowflake `__, `Splunk `__, `DuckDB `__. + + * XML Schema regexes are compatible except that they are implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. + * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). + * Basic POSIX regular expressions requires grouping to be escaped, i.e. ``\(``, ``\)``. + * Rust regexes are compatible except they do not recognize ``\/`` for a literal ``/``, which would have to be translated into just a single ``/`` for a literal match. From 80aa0e6eb027eb3765b37c3f3117aab8a172a1df Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Wed, 14 Feb 2024 17:29:56 +0100 Subject: [PATCH 10/28] Improve definition of regex format; be more precise about escapes in character classes --- optimade.rst | 53 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/optimade.rst b/optimade.rst index 92ed981df..fb51990a2 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3967,48 +3967,61 @@ An example of the sparse layout for multidimensional lists with three aggregated OPTIMADE Regular Expression Format ---------------------------------- -This section defines a string representation for regular expressions (regexes) to be referred to from other parts of the specification. -This format will be referred to as an "OPTIMADE regex". -Depending on the context in which an OPTIMADE regex appear, a delimiter may be required to enclose the regex (e.g., double quotes or a slash character) and some outer-level escape rules may apply. -Such delimiters and outer escape rules are not a part of the OPTIMADE regex format itself and have to be clarified when this format is referenced. -For example, if the string representing the regex is serialized as part of JSON data, the JSON escape rules for strings apply to distinguish an enclosing double quote from one that is part of the regex string. -The format documented here applies to the string after the JSON data has been deserialised. +This section defines a unicode string representation of regular expressions (regexes) to be referenced from other parts of the specification. +The format will be referred to as an "OPTIMADE regex". + +Regexes are commonly embedded in a contexts where they need to be enclosed by delimiters (e.g., double quotes or slash characters). +If this is the case, it is likely that some outer-level escape rules apply to allow the end delimiter to appear within the regex. +Such delimiters and escape rules are *not* included in the definition of the OPTIMADE regex format itself and needs to be clarified when this format is referenced. +The format defined in this section applies after such outer escape rules have been applied (e.g., when all occurences of ``\/`` have been translated into ``/`` for a format where an unescaped slash character is the end delimiter). +Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON) this section documents the format of the unicode string resulting from unserialization of that format. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. -However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and clarified which characters can be escaped. -The intent is that the specified format also is a subset of the `PCRE2 regex format `__, making the format directly useful without translation in a wide range of regex implementations. +However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and to clarify the limitations of character classes and character escapes. +The intent is that the specified format also is a subset of the `PCRE2 regex format `__ to make the format directly useful (without translation) in a wide range of regex implementations. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): - Individual Unicode characters, as defined by the `JSON specification `__. -- A literal escape of one of the syntax characters or the character ``/``, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } | /`` to represent that literal character. -- Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``). -- Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``) +- A literal escape of one of the syntax characters, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. +- Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``) with the following constraints: + + * The class has to be ordered so that it does not start with the character ``[``. + * If the first character is ``]`` it designates a class that includes a literal ``]`` (and not an empty class). + The ``]`` character cannot appear anywhere else in the class. + * The character ``-`` designates ranges unless it appears at the start or end of the class. + * A literal ``\`` is represented by an escaped backslash ``\\``. + * Except for as specified above, all characters represent themselves literally (including syntax characters). + * Each literal character can appear in the class at most once. + +- Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``). - Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one). + (These can only appear directly after a character, group, or character class.) - The beginning-of-input (``^``) and end-of-input (``$``) anchors. - Simple grouping (``(...)``) and alternation (``|``). Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). -Furthermore, there is no support for character classes shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a unicode character by its code point. +Furthermore, there is no support for character class shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a unicode character by its code point (i.e., one has to include it as the literal unicode character). -The expression matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. +An OPTIMADE regex matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". -OPTIMADE regexes that utilizes tokes and features documented by ECMA-262 beyond the designated subset is allowed to have an undefined behavior,i.e., it MAY match or not match any string, or MAY produce an error. +Regexes that utilizes tokes and features documented by ECMA-262 beyond the designated subset are allowed to have an undefined behavior,i.e., they MAY match or not match *any* string, or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. Compatibility notes: - Since the specification tolerates regexes using tokens and features beyond the defined subset (with undefined behavior), a regex can be directly handed over to an internal regex engine as long as it is compatible with the defined subset without need for validation or translation. - Compatibility with other regex formats may change between language versions and options provided to the respective implementation. - However, using third-party sources, e.g., the [Regular Expression Engine Comparison Chart](https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816)), we have collected the following information as a general guide [TODO WIP: check this]: + The specification tolerates (with undefined behavior) regexes using tokens and features beyond the defined subset. + Hence, a regex can be directly handed over to any internal regex engine that is compatible with the defined subset without need for validation or translation. + Compatibility with other regex formats may change between language versions and options. + As a general guide we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information: - * The following regex formats appear to be compatible: ECMAScript, PCRE (both v1 and v2), POSIX ERE, Python, Ruby, Tcl ARE, Java, .NET, MySQL, XPath, `MongoDB `__, `MS SQL Server `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `Snowflake `__, `Splunk `__, `DuckDB `__. + * `ECMAScript `__ and `PCRE `__ (both v1 and v2) are meant to be compatible by design. + * The following regex formats appear to be compatible: `Perl `__, `POSIX ERE `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL `__, `MongoDB `__, `MS SQL Server `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `Snowflake `__, `Splunk `__, `DuckDB `__. * XML Schema regexes are compatible except that they are implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). - * Basic POSIX regular expressions requires grouping to be escaped, i.e. ``\(``, ``\)``. - * Rust regexes are compatible except they do not recognize ``\/`` for a literal ``/``, which would have to be translated into just a single ``/`` for a literal match. + * Basic POSIX regular expressions differ from the defined format in that it requires groupings to be escaped, i.e. ``\(``, ``\)``. From c84566d5d8bee711a188ec993427dc59ba18ab78 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Wed, 14 Feb 2024 17:33:15 +0100 Subject: [PATCH 11/28] Remove unnecessary constraint on class order. --- optimade.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index fb51990a2..467c93e72 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3989,7 +3989,6 @@ Furthermore, it can only use the following tokens and features (this list is par - A literal escape of one of the syntax characters, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. - Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``) with the following constraints: - * The class has to be ordered so that it does not start with the character ``[``. * If the first character is ``]`` it designates a class that includes a literal ``]`` (and not an empty class). The ``]`` character cannot appear anywhere else in the class. * The character ``-`` designates ranges unless it appears at the start or end of the class. From fc411a4076f728c16e31b3bdebfc15be76918351 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Wed, 14 Feb 2024 17:38:50 +0100 Subject: [PATCH 12/28] Fix minor grammar error --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 467c93e72..13231b1b5 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4023,4 +4023,4 @@ Implementations that do not produce errors in this situation are RECOMMENDED to * The following regex formats appear to be compatible: `Perl `__, `POSIX ERE `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL `__, `MongoDB `__, `MS SQL Server `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `Snowflake `__, `Splunk `__, `DuckDB `__. * XML Schema regexes are compatible except that they are implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). - * Basic POSIX regular expressions differ from the defined format in that it requires groupings to be escaped, i.e. ``\(``, ``\)``. + * Basic POSIX regular expressions differ from the defined format in that they require several of the metacharacters to be escaped, e.g. ``\(``, ``\)``. From 6d3e5a2c61c1bd396b9fd8cde2c9d31e172507e1 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Thu, 15 Feb 2024 09:36:26 +0100 Subject: [PATCH 13/28] Capitalize Unicode Co-authored-by: Antanas Vaitkus --- optimade.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/optimade.rst b/optimade.rst index 13231b1b5..bbd53577f 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3967,14 +3967,14 @@ An example of the sparse layout for multidimensional lists with three aggregated OPTIMADE Regular Expression Format ---------------------------------- -This section defines a unicode string representation of regular expressions (regexes) to be referenced from other parts of the specification. +This section defines a Unicode string representation of regular expressions (regexes) to be referenced from other parts of the specification. The format will be referred to as an "OPTIMADE regex". Regexes are commonly embedded in a contexts where they need to be enclosed by delimiters (e.g., double quotes or slash characters). If this is the case, it is likely that some outer-level escape rules apply to allow the end delimiter to appear within the regex. Such delimiters and escape rules are *not* included in the definition of the OPTIMADE regex format itself and needs to be clarified when this format is referenced. The format defined in this section applies after such outer escape rules have been applied (e.g., when all occurences of ``\/`` have been translated into ``/`` for a format where an unescaped slash character is the end delimiter). -Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON) this section documents the format of the unicode string resulting from unserialization of that format. +Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON) this section documents the format of the Unicode string resulting from unserialization of that format. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. @@ -4003,7 +4003,7 @@ Furthermore, it can only use the following tokens and features (this list is par - Simple grouping (``(...)``) and alternation (``|``). Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). -Furthermore, there is no support for character class shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a unicode character by its code point (i.e., one has to include it as the literal unicode character). +Furthermore, there is no support for character class shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a Unicode character by its code point (i.e., one has to include it as the literal Unicode character). An OPTIMADE regex matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". From 3736ebd59b4874057f6966f3b23e4dd55461b154 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 16 Feb 2024 02:39:16 +0100 Subject: [PATCH 14/28] Clean-up and made more stringent --- optimade.rst | 59 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 25 deletions(-) diff --git a/optimade.rst b/optimade.rst index bbd53577f..5079ea40d 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3968,59 +3968,68 @@ An example of the sparse layout for multidimensional lists with three aggregated OPTIMADE Regular Expression Format ---------------------------------- This section defines a Unicode string representation of regular expressions (regexes) to be referenced from other parts of the specification. -The format will be referred to as an "OPTIMADE regex". +The format will be referred to as an "OPTIMADE regex." -Regexes are commonly embedded in a contexts where they need to be enclosed by delimiters (e.g., double quotes or slash characters). -If this is the case, it is likely that some outer-level escape rules apply to allow the end delimiter to appear within the regex. -Such delimiters and escape rules are *not* included in the definition of the OPTIMADE regex format itself and needs to be clarified when this format is referenced. -The format defined in this section applies after such outer escape rules have been applied (e.g., when all occurences of ``\/`` have been translated into ``/`` for a format where an unescaped slash character is the end delimiter). -Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON) this section documents the format of the Unicode string resulting from unserialization of that format. +Regexes are commonly embedded in a context where they need to be enclosed by delimiters (e.g., double quotes or slash characters). +If this is the case, some outer-level escape rules likely apply to allow the end delimiter to appear within the regex. +Such delimiters and escape rules are *not* included in the definition of the OPTIMADE regex format itself and need to be clarified when this format is referenced. +The format defined in this section applies after such outer escape rules have been applied (e.g., when all occurrences of ``\/`` have been translated into ``/`` for a format where an unescaped slash character is the end delimiter). +Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON), this section documents the format of the Unicode string resulting from the deserialization of that format. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and to clarify the limitations of character classes and character escapes. -The intent is that the specified format also is a subset of the `PCRE2 regex format `__ to make the format directly useful (without translation) in a wide range of regex implementations. +The intent is that the specified format is also a subset of the `PCRE2 regex format `__ to make the format directly useful (without translation) in a wide range of regex implementations. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): -- Individual Unicode characters, as defined by the `JSON specification `__. +- Individual Unicode characters matching themselves, as defined by the `JSON specification `__. +- The ``.`` character to match any one Unicode character except the line break characters LINE FEED (LF) (U+000A), CARRAGE RETURN (U+000D), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2029) (see `ECMA-262 section 2.2.2.7 `__). - A literal escape of one of the syntax characters, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. + No other characters can be escaped. + (This rule prevents other escapes that are interpreted differently depending on regex flavor.) - Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``) with the following constraints: - * If the first character is ``]`` it designates a class that includes a literal ``]`` (and not an empty class). - The ``]`` character cannot appear anywhere else in the class. - * The character ``-`` designates ranges unless it appears at the start or end of the class. - * A literal ``\`` is represented by an escaped backslash ``\\``. + * The character ``-`` designates ranges unless it is the first or last character of the class. + * The characters ``\ [ ]`` can only appear escaped with a preceeding backslash, e.g. ``\\`` designates that the class includes a literal ``\`` character. + The other syntax characters may appear either escaped or unescaped to designate that the class includes them. + (This rule prevents other escapes inside classes that are not the same across regex flavors and expressions that, in some flavors, are interpreted as nested classes.) * Except for as specified above, all characters represent themselves literally (including syntax characters). * Each literal character can appear in the class at most once. + (This rule prevents expressions interpreted as pre-defined classes in some regex flavors, e.g., ``[:alpha:]``). - Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``). -- Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one). - (These can only appear directly after a character, group, or character class.) +- Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one) that appear directly after a character, group, or character class. + (This rule prevents expressions with special meaning in some regex flavors, e.g., ``+?`` and ``(?`` ).) - The beginning-of-input (``^``) and end-of-input (``$``) anchors. - Simple grouping (``(...)``) and alternation (``|``). -Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are NOT included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). -Furthermore, there is no support for character class shorthands via the backslash character ``\`` and a letter, nor is there a way to represent a Unicode character by its code point (i.e., one has to include it as the literal Unicode character). +Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are *not* included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). +Furthermore, there is no support for escapes designating shorthand character classes as ``\`` and a letter or number, nor is there any way to represent a Unicode character by specifying a code point as a number, only via the Unicode character itself. +(However, the regex can be embedded in a context that defines such escapes, e.g., in serialized JSON a string containing the character ``\u`` followed by four hexadecimal digits is deserialized into the corresponding Unicode character.) An OPTIMADE regex matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". -Regexes that utilizes tokes and features documented by ECMA-262 beyond the designated subset are allowed to have an undefined behavior,i.e., they MAY match or not match *any* string, or MAY produce an error. +Regexes that utilize tokes and features documented by ECMA-262 beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. Compatibility notes: - The specification tolerates (with undefined behavior) regexes using tokens and features beyond the defined subset. - Hence, a regex can be directly handed over to any internal regex engine that is compatible with the defined subset without need for validation or translation. - Compatibility with other regex formats may change between language versions and options. - As a general guide we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information: + The definition tolerates (with undefined behavior) regexes that use tokens and features beyond the defined subset. + Hence, a regex can be directly handed over to a backend implementation compatible with the subset without needing validation or translation. + Additional consideration of how the ``.`` character operates in relation to line breaks may be required for multiline text. + If the regex is applied to strings containing only the LINE FEED (U+000A) character and none of the other Unicode line break characters, most regex backend implementations are compatible with the defined behavior. + If the regex is applied to string data containing arbitrary combinations of line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. - * `ECMAScript `__ and `PCRE `__ (both v1 and v2) are meant to be compatible by design. + Compatibility with different regex implementations may change depending on the environment, language versions, and options and has to be verified by implementations. + However, as a general guide, we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information for compatibility when operating on text using LINE FEED as the line break character: - * The following regex formats appear to be compatible: `Perl `__, `POSIX ERE `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL `__, `MongoDB `__, `MS SQL Server `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `Snowflake `__, `Splunk `__, `DuckDB `__. - * XML Schema regexes are compatible except that they are implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. + * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design. + + * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). - * Basic POSIX regular expressions differ from the defined format in that they require several of the metacharacters to be escaped, e.g. ``\(``, ``\)``. + * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. + * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. From 864747cba70248676a53d4b49d9506bd00ec4ae4 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 16 Feb 2024 02:54:40 +0100 Subject: [PATCH 15/28] Re-revert mistankenly reverted logical quoting style. --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 5079ea40d..183d81b0d 100644 --- a/optimade.rst +++ b/optimade.rst @@ -3968,7 +3968,7 @@ An example of the sparse layout for multidimensional lists with three aggregated OPTIMADE Regular Expression Format ---------------------------------- This section defines a Unicode string representation of regular expressions (regexes) to be referenced from other parts of the specification. -The format will be referred to as an "OPTIMADE regex." +The format will be referred to as an "OPTIMADE regex". Regexes are commonly embedded in a context where they need to be enclosed by delimiters (e.g., double quotes or slash characters). If this is the case, some outer-level escape rules likely apply to allow the end delimiter to appear within the regex. From c20ad035b417d2c3cf7eb37644fb51622bdf2232 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 16 Feb 2024 03:05:30 +0100 Subject: [PATCH 16/28] Minor formulation adjustments, correcting rst linebreaks --- optimade.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/optimade.rst b/optimade.rst index 183d81b0d..e88001ba9 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4013,7 +4013,7 @@ Furthermore, there is no support for escapes designating shorthand character cla An OPTIMADE regex matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". -Regexes that utilize tokes and features documented by ECMA-262 beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. +Regexes that utilize tokes and features beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. Compatibility notes: @@ -4032,4 +4032,5 @@ Implementations that do not produce errors in this situation are RECOMMENDED to * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. - * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. + * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. + POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. From fb2e2d363fd41eef89e0a0e15ae826f2f807b5e7 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 16 Feb 2024 03:20:00 +0100 Subject: [PATCH 17/28] Remove trailing whitespace --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index e88001ba9..a35104c6e 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4032,5 +4032,5 @@ Implementations that do not produce errors in this situation are RECOMMENDED to * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. - * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. + * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. From dcffe75777e8fb0425b45eb70b1f1ce8b41b233e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Saulius=20Gra=C5=BEulis?= Date: Thu, 21 Mar 2024 15:53:36 +0200 Subject: [PATCH 18/28] =?UTF-8?q?"tokes"=20=E2=86=92=20"tokens"=20=3F?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index a35104c6e..88a07b6d2 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4013,7 +4013,7 @@ Furthermore, there is no support for escapes designating shorthand character cla An OPTIMADE regex matches the string at any position unless it contains a leading beginning-of-input (``^``) or trailing end-of-input (``$``) anchor listed above, i.e., the anchors are not implicitly assumed. For example, the OPTIMADE regex "es" matches "expression". -Regexes that utilize tokes and features beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. +Regexes that utilize tokens and features beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. Compatibility notes: From 610a7c1d44199e3c2f672abbd31049099d920885 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:09:17 +0100 Subject: [PATCH 19/28] Apply suggestions from review --- optimade.rst | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/optimade.rst b/optimade.rst index 8b46300d6..c2c9cd60d 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4015,9 +4015,10 @@ The format defined in this section applies after such outer escape rules have be Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON), this section documents the format of the Unicode string resulting from the deserialization of that format. The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. -The format is closely inspired by the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. -However, OPTIMADE has decided to restrict the subset further to better align it with the features available in common database backends and to clarify the limitations of character classes and character escapes. -The intent is that the specified format is also a subset of the `PCRE2 regex format `__ to make the format directly useful (without translation) in a wide range of regex implementations. +The restrictions are chosen to match features commonly available in database backends. +The subset is compatible with the `PCRE2 regex engine `__ when it is configured accordingly. +The subset is also intended to be compatible with, but even further restricted than, the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. +The compatibility with the JSON Schema standard is expressed as "intended" since there is some room for interpretation of the precise features included in the subset of regex features recommended in that standard. Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. @@ -4025,22 +4026,21 @@ Furthermore, it can only use the following tokens and features (this list is par - Individual Unicode characters matching themselves, as defined by the `JSON specification `__. - The ``.`` character to match any one Unicode character except the line break characters LINE FEED (LF) (U+000A), CARRAGE RETURN (U+000D), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2029) (see `ECMA-262 section 2.2.2.7 `__). -- A literal escape of one of the syntax characters, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. +- A literal escape of `one of the characters defined as syntax characters in the ECMA-262 standard `__, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. No other characters can be escaped. (This rule prevents other escapes that are interpreted differently depending on regex flavor.) -- Simple character classes (e.g., ``[abc]``) and range character classes (e.g., ``[a-z]``) with the following constraints: +- Simple character classes (e.g., ``[abc]``), complemented character classes (e.g. [^abc]) and their ranged versions (e.g., ``[a-z]``, ``[^a-z]``) with the following constraints: - * The character ``-`` designates ranges unless it is the first or last character of the class. - * The characters ``\ [ ]`` can only appear escaped with a preceeding backslash, e.g. ``\\`` designates that the class includes a literal ``\`` character. + * The character ``-`` designates ranges, unless it is the first or last character of the class in which case it represents a literal ``-`` character. + * If the first character is ``^`` then the expression matches all characters *except* the ones specified by the class as defined by the characters that follows. + * The characters ``\ [ ]`` can only appear escaped with a preceding backslash, e.g. ``\\`` designates that the class includes a literal ``\`` character. The other syntax characters may appear either escaped or unescaped to designate that the class includes them. (This rule prevents other escapes inside classes that are not the same across regex flavors and expressions that, in some flavors, are interpreted as nested classes.) - * Except for as specified above, all characters represent themselves literally (including syntax characters). - * Each literal character can appear in the class at most once. - (This rule prevents expressions interpreted as pre-defined classes in some regex flavors, e.g., ``[:alpha:]``). - -- Complemented character classes (e.g., ``[^abc]``, ``[^a-z]``). + * Except as specified above, all characters represent themselves literally (including syntax characters). + * Characters that represent themselves literally can only appear at most once. + (This rule prevents various kinds of extended character class syntax that differs between regex formats that assigns special meaning to duplicated characters, e.g., POSIX character classes, e.g., ``[:alpha:]``, equivalence classes, e.g., ``[=a=]``, set constructs, e.g. ``[A--B]``, ``[A&&B]``, etc.). - Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one) that appear directly after a character, group, or character class. - (This rule prevents expressions with special meaning in some regex flavors, e.g., ``+?`` and ``(?`` ).) + (This rule prevents expressions with special meaning in some regex flavors, e.g., ``+?`` and ``(?...)``.) - The beginning-of-input (``^``) and end-of-input (``$``) anchors. - Simple grouping (``(...)``) and alternation (``|``). @@ -4060,9 +4060,9 @@ Implementations that do not produce errors in this situation are RECOMMENDED to Hence, a regex can be directly handed over to a backend implementation compatible with the subset without needing validation or translation. Additional consideration of how the ``.`` character operates in relation to line breaks may be required for multiline text. If the regex is applied to strings containing only the LINE FEED (U+000A) character and none of the other Unicode line break characters, most regex backend implementations are compatible with the defined behavior. - If the regex is applied to string data containing arbitrary combinations of line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. + If the regex is applied to string data containing arbitrary combinations of Unicode line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. - Compatibility with different regex implementations may change depending on the environment, language versions, and options and has to be verified by implementations. + Compatibility with different regex implementations may change depending on the environment, implementation programming language versions, and options and has to be verified by implementations. However, as a general guide, we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information for compatibility when operating on text using LINE FEED as the line break character: * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design. From 1315ad8d97bacd705d94431039fc8c181ccfa1cd Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:41:19 +0100 Subject: [PATCH 20/28] Restructuring to remove repetitions and improve order --- optimade.rst | 41 ++++++++++++++++++++--------------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/optimade.rst b/optimade.rst index c2c9cd60d..385bb6e94 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4014,15 +4014,10 @@ Such delimiters and escape rules are *not* included in the definition of the OPT The format defined in this section applies after such outer escape rules have been applied (e.g., when all occurrences of ``\/`` have been translated into ``/`` for a format where an unescaped slash character is the end delimiter). Likewise, if an OPTIMADE regex is embedded in a serialized data format (e.g., JSON), this section documents the format of the Unicode string resulting from the deserialization of that format. -The format is a subset of the format described in `ECMA-262, section 21.2.1 `__. -The restrictions are chosen to match features commonly available in database backends. -The subset is compatible with the `PCRE2 regex engine `__ when it is configured accordingly. -The subset is also intended to be compatible with, but even further restricted than, the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. -The compatibility with the JSON Schema standard is expressed as "intended" since there is some room for interpretation of the precise features included in the subset of regex features recommended in that standard. +An OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with additional restrictions described below which define a subset of the ECMA-262 format chosen to match features commonly available in different database backends. +The regex is interpreted according to the ECMA-262 processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. -Hence, an OPTIMADE regex is a regular expression that adheres to `ECMA-262, section 21.2.1 `__ with the additional restrictions described in the following. -The regex is interpreted according to the processing rules that apply for an expression where only the Unicode variable is set to true of all variables set by the RegExp internal slot described by `ECMA-262, section 21.2.2.1 `__. -Furthermore, it can only use the following tokens and features (this list is partially quoted from the JSON Schema standard): +The subset includes only the following tokens and features: - Individual Unicode characters matching themselves, as defined by the `JSON specification `__. - The ``.`` character to match any one Unicode character except the line break characters LINE FEED (LF) (U+000A), CARRAGE RETURN (U+000D), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2029) (see `ECMA-262 section 2.2.2.7 `__). @@ -4056,19 +4051,23 @@ Implementations that do not produce errors in this situation are RECOMMENDED to Compatibility notes: - The definition tolerates (with undefined behavior) regexes that use tokens and features beyond the defined subset. - Hence, a regex can be directly handed over to a backend implementation compatible with the subset without needing validation or translation. - Additional consideration of how the ``.`` character operates in relation to line breaks may be required for multiline text. - If the regex is applied to strings containing only the LINE FEED (U+000A) character and none of the other Unicode line break characters, most regex backend implementations are compatible with the defined behavior. - If the regex is applied to string data containing arbitrary combinations of Unicode line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. + * The subset is intended to be compatible with, but even further restricted than, the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. + The compatibility with the JSON Schema standard is expressed here as "intended" since there is some room for interpretation of the precise features included in the recommendation given in that standard. - Compatibility with different regex implementations may change depending on the environment, implementation programming language versions, and options and has to be verified by implementations. - However, as a general guide, we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information for compatibility when operating on text using LINE FEED as the line break character: + * The definition tolerates (with undefined behavior) regexes that use tokens and features beyond the defined subset. + Hence, a regex can be directly handed over to a backend implementation compatible with the subset without needing validation or translation. + + * Additional consideration of how the ``.`` character operates in relation to line breaks may be required for multiline text. + If the regex is applied to strings containing only the LINE FEED (U+000A) character and none of the other Unicode line break characters, most regex backend implementations are compatible with the defined behavior. + If the regex is applied to string data containing arbitrary combinations of Unicode line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. - * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design. + * Compatibility with different regex implementations may change depending on the environment, implementation programming language versions, and options and has to be verified by implementations. + However, as a general guide, we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information for compatibility when operating on text using LINE FEED as the line break character: - * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). - * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). - * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. - * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. - POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. + * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design when used with appropriate options. + + * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). + * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). + * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. + * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. + POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. From 680d4e3d40806c3885f29a44d9ff9a105b91f25a Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:43:21 +0100 Subject: [PATCH 21/28] Missing oxford comma --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 385bb6e94..338bf7522 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4024,7 +4024,7 @@ The subset includes only the following tokens and features: - A literal escape of `one of the characters defined as syntax characters in the ECMA-262 standard `__, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. No other characters can be escaped. (This rule prevents other escapes that are interpreted differently depending on regex flavor.) -- Simple character classes (e.g., ``[abc]``), complemented character classes (e.g. [^abc]) and their ranged versions (e.g., ``[a-z]``, ``[^a-z]``) with the following constraints: +- Simple character classes (e.g., ``[abc]``), complemented character classes (e.g. [^abc]), and their ranged versions (e.g., ``[a-z]``, ``[^a-z]``) with the following constraints: * The character ``-`` designates ranges, unless it is the first or last character of the class in which case it represents a literal ``-`` character. * If the first character is ``^`` then the expression matches all characters *except* the ones specified by the class as defined by the characters that follows. From ec8d7a72694f4fd3df269da1ebb783123beafdc1 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:44:59 +0100 Subject: [PATCH 22/28] Fix rst formatting issue --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 338bf7522..d32aab7c6 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4024,7 +4024,7 @@ The subset includes only the following tokens and features: - A literal escape of `one of the characters defined as syntax characters in the ECMA-262 standard `__, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. No other characters can be escaped. (This rule prevents other escapes that are interpreted differently depending on regex flavor.) -- Simple character classes (e.g., ``[abc]``), complemented character classes (e.g. [^abc]), and their ranged versions (e.g., ``[a-z]``, ``[^a-z]``) with the following constraints: +- Simple character classes (e.g., ``[abc]``), complemented character classes (e.g. ``[^abc]``), and their ranged versions (e.g., ``[a-z]``, ``[^a-z]``) with the following constraints: * The character ``-`` designates ranges, unless it is the first or last character of the class in which case it represents a literal ``-`` character. * If the first character is ``^`` then the expression matches all characters *except* the ones specified by the class as defined by the characters that follows. From 9bc0135d399af584e32bf846004146291cc029b6 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:51:01 +0100 Subject: [PATCH 23/28] Fix reference to JSON Schema standard before it is discussed --- optimade.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/optimade.rst b/optimade.rst index d32aab7c6..bb2493a10 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4039,7 +4039,7 @@ The subset includes only the following tokens and features: - The beginning-of-input (``^``) and end-of-input (``$``) anchors. - Simple grouping (``(...)``) and alternation (``|``). -Note that compared to the JSON Schema standard, lazy quantifiers (``+?``, ``*?``, ``??``) are *not* included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). +Note that lazy quantifiers (``+?``, ``*?``, ``??``) are *not* included, nor are range quantifiers (``{x}``, ``{x,y}``, ``{x,}``). Furthermore, there is no support for escapes designating shorthand character classes as ``\`` and a letter or number, nor is there any way to represent a Unicode character by specifying a code point as a number, only via the Unicode character itself. (However, the regex can be embedded in a context that defines such escapes, e.g., in serialized JSON a string containing the character ``\u`` followed by four hexadecimal digits is deserialized into the corresponding Unicode character.) @@ -4048,7 +4048,6 @@ For example, the OPTIMADE regex "es" matches "expression". Regexes that utilize tokens and features beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. - Compatibility notes: * The subset is intended to be compatible with, but even further restricted than, the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. From bda92b6faf99d132c555b19910f64bfebd5b8ae1 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:51:36 +0100 Subject: [PATCH 24/28] Formatting fix --- optimade.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/optimade.rst b/optimade.rst index bb2493a10..5e304a54b 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4048,6 +4048,7 @@ For example, the OPTIMADE regex "es" matches "expression". Regexes that utilize tokens and features beyond the designated subset are allowed to have an undefined behavior, i.e., they MAY match or not match *any* string or MAY produce an error. Implementations that do not produce errors in this situation are RECOMMENDED to generate warnings if possible. + Compatibility notes: * The subset is intended to be compatible with, but even further restricted than, the subset recommended in the JSON Schema standard, see `JSON Schema: A Media Type for Describing JSON Documents 2020-12, section 6.4 `__. From 95698f6c61d929ea6ee905f1fd164a7fbed54fdf Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 01:56:35 +0100 Subject: [PATCH 25/28] Minor text improvement --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 5e304a54b..608644007 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4033,7 +4033,7 @@ The subset includes only the following tokens and features: (This rule prevents other escapes inside classes that are not the same across regex flavors and expressions that, in some flavors, are interpreted as nested classes.) * Except as specified above, all characters represent themselves literally (including syntax characters). * Characters that represent themselves literally can only appear at most once. - (This rule prevents various kinds of extended character class syntax that differs between regex formats that assigns special meaning to duplicated characters, e.g., POSIX character classes, e.g., ``[:alpha:]``, equivalence classes, e.g., ``[=a=]``, set constructs, e.g. ``[A--B]``, ``[A&&B]``, etc.). + (This rule prevents various kinds of extended character class syntax that differs between regex formats that assigns special meaning to duplicated characters such as POSIX character classes, e.g., ``[:alpha:]``, equivalence classes, e.g., ``[=a=]``, set constructs, e.g. ``[A--B]``, ``[A&&B]``, etc.). - Simple quantifiers: ``+`` (one or more), ``*`` (zero or more), ``?`` (zero or one) that appear directly after a character, group, or character class. (This rule prevents expressions with special meaning in some regex flavors, e.g., ``+?`` and ``(?...)``.) - The beginning-of-input (``^``) and end-of-input (``$``) anchors. From cba1227b45a7994c4432f2c797f1a77718c08c26 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 02:03:26 +0100 Subject: [PATCH 26/28] Delete trailing whitespace --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index 608644007..0047d42c5 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4056,7 +4056,7 @@ Implementations that do not produce errors in this situation are RECOMMENDED to * The definition tolerates (with undefined behavior) regexes that use tokens and features beyond the defined subset. Hence, a regex can be directly handed over to a backend implementation compatible with the subset without needing validation or translation. - + * Additional consideration of how the ``.`` character operates in relation to line breaks may be required for multiline text. If the regex is applied to strings containing only the LINE FEED (U+000A) character and none of the other Unicode line break characters, most regex backend implementations are compatible with the defined behavior. If the regex is applied to string data containing arbitrary combinations of Unicode line break characters and the right behavior cannot be achieved via environmental settings and regex options, implementations can consider a translation step where other line break characters are translated into LINE FEED in the text operated on. From 188b2f6a2d3dc756cb83606f3a603d3e316e3332 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 09:22:56 +0100 Subject: [PATCH 27/28] Fix rendering issue in compatibility notes --- optimade.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/optimade.rst b/optimade.rst index 0047d42c5..b06d36ec3 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4064,10 +4064,10 @@ Implementations that do not produce errors in this situation are RECOMMENDED to * Compatibility with different regex implementations may change depending on the environment, implementation programming language versions, and options and has to be verified by implementations. However, as a general guide, we have used third-party sources, e.g., the `Regular Expression Engine Comparison Chart `__ to collect the following information for compatibility when operating on text using LINE FEED as the line break character: - * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design when used with appropriate options. + * `ECMAScript (also known as javascript) `__ and version 1 and 2 of `PCRE `__ are meant to be compatible by design when used with appropriate options. - * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). - * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). - * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. - * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. - POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. + * The following regex formats appear generally compatible when operating in Unicode mode: `Perl `__, `Python `__, `Ruby `__, `Rust `__, `Java `__, `.NET `__, `MySQL 8 `__, `MongoDB `__, `Oracle `__, `IBM Db2 `__, `Elasticsearch `__, `DuckDB `__ (which uses the `re2 `__ library). + * SQLite supports regexes via libraries and thus can use a compatible format (e.g., PCRE2). + * XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``. + * POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes. + POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed. From 24b811b4168dad6badca631cccd8e5e913b2c528 Mon Sep 17 00:00:00 2001 From: Rickard Armiento Date: Fri, 22 Mar 2024 09:46:24 +0100 Subject: [PATCH 28/28] Fix unicode character reference Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com> --- optimade.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimade.rst b/optimade.rst index b06d36ec3..f9fc765b9 100644 --- a/optimade.rst +++ b/optimade.rst @@ -4019,7 +4019,7 @@ The regex is interpreted according to the ECMA-262 processing rules that apply f The subset includes only the following tokens and features: -- Individual Unicode characters matching themselves, as defined by the `JSON specification `__. +- Individual Unicode characters matching themselves, as defined by the JSON specification (:RFC:`8259`). - The ``.`` character to match any one Unicode character except the line break characters LINE FEED (LF) (U+000A), CARRAGE RETURN (U+000D), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2029) (see `ECMA-262 section 2.2.2.7 `__). - A literal escape of `one of the characters defined as syntax characters in the ECMA-262 standard `__, i.e., the escape character (``\``) followed by one of the following characters ``^ $ \ . * + ? ( ) [ ] { } |`` to represent that literal character. No other characters can be escaped.