Checking for substrings of the node values #75

nkutsche · 2024-06-12T09:03:28Z

I developed years ago an extension of Schematron for my Escali to enhance the possibilities of checking sub-phrases inside of text nodes. During XMLPrague 2024 this topic came up again in discussions and I would like to follow up this discussion here:

Currently you can only check nodes in Schematron. If you want to avoid for instance the phrase "foo" in any text node you would do the following:

<sch:rule context="text()">
    <sch:report test="matches(., 'foo')">Don't use 'foo'</sch:report>
</sch:rule>

This has the following problems:

If you have multiple occurrences of "foo" in a text node you get always only one validation error for this node.
If you have a long text node it may very hard to find the exact phrase which causes the error, especially if the searched phrase is a single sign and/or something well hidden.

The basic idea of the extension was to provide a regular expression additional to the matched context. I would suggest something like this (reworking my initial proposal):

<sch:rule context="text()">
    <sch:report-phrase regex="foo" flags="i">Don't use '<sch:value-of select="."/>'</sch:report>
</sch:rule>

The following rules should be applied:

The context node is atomized and analyzed by the given regular expression in @regex
- Any match will produce a potential validation error.
- The flags in @flags are used for the regex analyzing as known in several XPath functions (e.g. match#3)
A sch:report-phrase/@test attribute can be added optionally
- The value is an XPath expression, the context is the matched substring (similar as in xsl:analyze-string)
- If the return value is true the matched substring will produce a validation error
- The default value is true()
The context inside of sch:report-phrase is also the matched substring

I don't think it is hard to implement, as the behavior is completly covered by xsl:analyze-string. The highest challenge is to specify the location of the substring and integrate it into SVRL, isn't it? I did it in my implementation by adding character positions (start and length) to the location node path.

What do you think?

The text was updated successfully, but these errors were encountered:

rjelliffe · 2024-06-12T15:35:06Z

(revised)
At the moment, I think Schematron supports the following (relying on analyze-string() to return an element tree):

<sch:rule context="xxx/text()">
   <sch:let name="analyzed-string"  select-"analyze-text(., 'foo')"       
                                                             as="element(analyze-string-result)?" />
   <sch:let name="first-matched"    select=$analyzed-string//matched[1]" />
   <sch:let name="second-matched"  select=$analyzed-string//matched[2]" />
   <sch:let name="third-matched"     select=$analyzed-string//matched[3]" />
   <sch:let name="talley-of-matched" select="count($analyzed-string//matched)" />

   <sch:report test="$first-matched">
       The naughty text 'foo' was found in text <sch:value-of select="
          concat($first-matched/preceding-sibling::*[1], $first-matched, $first-matched/following-sibling::*[1])" />"</sch:report>

   <sch:report test="$second-matched">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat($second-matched/preceding-sibling::*[1], $second-matched, $second-matched/following-sibling::*[1])" />"</sch:report>

   <sch:report test="$third-matched">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat($third-matched/preceding-sibling::*[1], $third-matched, $third-matched/following-sibling::*[1])" />"</sch:report> 

    <sch:report test="$talley-of-matches &gt; 3">Multiple further 'foo' were found.</sch:report>
</sch:rule>

This gives what I think is the desired behaviour: multiple reports, and with better diagnostics. For the particular use-case that nkutsche provides, I suggest the approach above would be all that is needed: it is not hard to enumerate small numbers of rules, where there are small numbers of candidates.

BUT... if it could be shown that there was a use-case for long list or complex data created by functions, then we do have a gap in Schematron.

(In this view, the problem is that there is no way to iterate assertions/reports over some element in a variable containing a tree of elements. Or, to iterate over a list of things in an variable. )

One way to approach it would be to adopt something like proposal #63, to allow sch:pattern/@document to reference a document (or sequence of elements) in a variable.

<sch:let name="list-of-analysed-text" 
     value="for $t in //xxx/text() return analyse-string($t, 'foo')" 
     as="element()*" />
<sch:pattern document="$list-of-analysed-text" >
    <sch:rule context="/analyse-string-result/matched">
         <sch:report test="true()">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat(preceding-sibling::*[1], ., following-sibling::*[1])" />"</sch:report>
    </sch:rule>
</sch:pattern>

The advantage, perhaps, of this is that it gives the full expressive power of Schematron to the parsed text and it fits in with the notion of a "pattern" better.

Another approach would be to do something on sch:rule, such as an attribute @visit-each:

<sch:pattern  >
    <sch:rule context="xxx/text()"  visit-each="analyze-string(., 'foo')/analyze-string-view/match" > 
         <sch:report test="true()">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat(preceding-sibling::*[1], ., following-sibling::*[1])" />"</sch:report>
    </sch:rule>
</sch:pattern>

I think it is good for the standard to try to avoid mixing things related to particular QLBs (i.e. XSLT and the limitations of the XDM) and better to put in generic changes that can apply to all QLBs.

AndrewSales · 2024-06-12T16:27:55Z

Thanks, @nkutsche and @rjelliffe . FWIW, I like:

<sch:rule context="xxx/text()" visit-each="analyze-string(., 'foo')/analyze-string-view/match">

I have a dim recollection that we also had something similar in the (proprietary) syntax of XMLProbe.

The processing logic would presumably be:

@visit-each evaluates to a sequence of any length against the rule/@context;
if the result is the empty sequence, the rule still fires;
the @tests are then evaluated against every item in the sequence.

It would seem to make sense to pass @visit-each through to the SVRL output too.

rjelliffe · 2024-06-13T08:08:59Z

Ideally, the @subject should be an XPath that corresponds to the subrange matched too.

The <matched and <unmatched are not available as indexes to the original string.

But lets suppose the @visit-each gets constrained to only include things specified in the QLB. So for xslt2 and 3 it might be the function analyse-string() only. (Are there any other functions in Xpath2 and 3 that return elements?)

If this was the constraint, then an implementation could certainly count the characters in the analyse-string-result/*/text() to find the index into the original text() or whatever and put out an Xpath with substring() as the @subject.

rjelliffe · 2024-06-13T10:56:52Z

Is it that the general case here is what we used to call "embedded notations"? I.e. where there is some language embedded.

For example, CSS, JavaScript and so on. JSON embedded in XML. Or SQL etc. ... or XPath.

For example, imagine a Schematron schema for XSLT that allowed us to apply house-style rules to the Xpaths?

  <sch:rule context="xslt:template/@match"  view-as="parse-xpath-to-xml(.)/Im-a-parsed-xpath-top-element" >
      <xsl:report test="/Im-a-parsed-xpath-top-element//function[@value='meep-meepl']">
               According to our rules, we do not allow the meep-meep() function.
     </xsl:report>
  </sch:rule>

This is, of course, one of the way that syntax checkers like Java FindBugs or PMD work. They create an XML parse tree, then use some XPath system to express rules.

dmj · 2024-06-13T15:43:49Z

Thanks, @nkutsche and @rjelliffe . FWIW, I like:

<sch:rule context="xxx/text()" visit-each="analyze-string(., 'foo')/analyze-string-view/match">

I have a dim recollection that we also had something similar in the (proprietary) syntax of XMLProbe.

The processing logic would presumably be:
* `@visit-each` evaluates to a sequence of any length against the `rule/@context`;

* if the result is the empty sequence, the rule still fires;

* the `@test`s are then evaluated against every item in the sequence.
It would seem to make sense to pass @visit-each through to the SVRL output too.

+1 for @‍visit-each.

I would not restrict the value to a particular function but a) its specification part of the query language bindinung and b) define the respective XPath version for XSLT & XPath.

Running the assertion over a transformed/parsed context would also allow the validation of structured data in attributes ("embedded notations") easier to express.

rjelliffe · 2024-06-13T17:29:36Z

And this might strengthen the case for sch:rule-set: so you can make multiple assertions on the same synthesized data.

…

On Fri, Jun 14, 2024 at 1:44 AM David Maus ***@***.***> wrote: Thanks, @nkutsche <https://github.com/nkutsche> and @rjelliffe <https://github.com/rjelliffe> . FWIW, I like: <sch:rule context="xxx/text()" visit-each="analyze-string(., 'foo')/analyze-string-view/match"> I have a dim recollection that we also had something similar in the (proprietary) syntax of XMLProbe. The processing logic would presumably be: * ***@***.***` evaluates to a sequence of any length against the ***@***.***`; * if the result is the empty sequence, the rule still fires; * the ***@***.***`s are then evaluated against every item in the sequence. It would seem to make sense to pass @visit-each through to the SVRL output too. +1 for @‍visit-each. I would not restrict the value to a particular function but a) its specification part of the query language bindinung and b) define the respective XPath version for XSLT & XPath. Running the assertion over a transformed/parsed context would also allow the validation of structured data in attributes ("embedded notations") easier to express. — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF65KKPZJJFWSVAXJD2D553ZHG44ZAVCNFSM6AAAAABJF5HP2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWGA2DOMZSGA> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

nkutsche · 2024-06-14T08:15:33Z

This discussion goes in a direction I did not expected. And I'm not sure if I understand everything.

But my impression is, that your focus is on my first issue:

This has the following problems:

If you have multiple occurrences of "foo" in a text node you get always only one validation error for this node.

At least as important is the second issue:

If you have a long text node it may very hard to find the exact phrase which causes the error, especially if the searched phrase is a single sign and/or something well hidden.

Let me demonstrate this on an example.

I implemented @rjelliffe's approach for a regex match with pure Schematron. That's how it looks in Oxygen UI:

Now compare it with the same regex match just with the Escali extension:

What do you think how much time you can save without searching for the exactly location of the bad terms in just that small paragraphs? And now assume you have much longer documents with a lot of checks for much more hidden patterns (like abbreviations)!

If I understand @AndrewSales correctly, @visit-each would allow a free XPath expression. But that would not make it possible for an UI to provide the exact position of the matched patterns, would it?

rjelliffe · 2024-06-14T15:30:46Z

(Edited)

What do you think how much time you can save without searching for the exactly location of the bad terms in just that small paragraphs? And now assume you have much longer documents with a lot of checks for much more hidden patterns (like abbreviations)! If I understand @AndrewSales <https://github.com/AndrewSales> correctly, @ visit-each would allow a free XPath expression. But that would not make it possible for an UI to provide the exact position of the matched patterns, would it?

If I can generalize Nico's issue, the question for the @visit-each approach is how each item on the transformed data carries information relating the original context with it, suitable for SRVL (or other outputs) to exactly pinpoint the location. It would be a backwards step for Schematron to lose track of the positions in the XML, after all. But XDM does not support "slices" (substrings that are indexes into an existing string.) I think the answer is that the QLB (e.g., for XSLT 3) specifies the standard transformation functions it supports, and these must have XML output with some index information. * So the implementation parses the @view-as value for QLB specified keywords such as 'analyse-string' and tweaks their output accordingly. In the case of analyse-string(), this would involve each analyse-string-result/* element being decorated with an attribute @sch:start. This would contain string-length(string-join(preceding-sibling::*)). What about functions that don't return XML elements? In the case of, say, tokenize() which goes from text() to xs:string*, the implementation would wrap each token with some element tag that includes the start index. e.g ```` <sch:rule context="fred" view-as="tokenize( '(a*b*c*)* )"> ```` where element fred has value 'abcabacb' would iterate over say ```` <string sch:start="1">abc</token> <string sch:start="4">ab</token> <string sch:start="6">ac</token> <string sch:start="8">b</token> ```` So I expect that that @view-as needs to be split to expose the XML to be tweaked E.g: ```` <sch:rule context="dog" view-as="licence/analyse-string( ...) " each="/analyse-string-result/matched" /> or <sch:rule context="dog"> <sch:view-as with="licence/analyse-string( ...) " select="/analyse-string-result/matched" /> ```` So how would you then transfer the index to the outside SVRL etc? You would use properties, which are the generic way to decorate the SVRL (or whatever) with dynamic information intended for use by subsequent processes rather than humans directly. : ```` <sch:report text="contains(., 'kitty')" properties="string-position" A dog licence should not have the string 'kitty'</report> ... <sch:property name="string-position"> <string-start><sch:value select=" @ sch:start"/></string-start> <string-end><sch:value select=" @ sch;start + string-length(.)"></string-end> </sch:property> ```` In other words, to register a function for @view-as the QLB needs to specify; - its name that will be recognized - how to tweak the functions output so that is XML with appropriate index markup (e.g., sch:string) - plus probably some standardized property for supporting slices, to transfer the range indexes into the SVRL. (I am not sure, but did I hear that XPath 4 might include slices or text ranges? @view-as might find that useful.) Rick * @andrew: presumably, ISO Schematron 2005 standard would have a new annex specifying how some standard set of XSLT functions that go from text() to element()* are handled and decorated. The each QLB that was interested would reference that. I think the QLB mechamism is appropriate, as it is intended to ensure that an implementation has everything needed to interpret the schema.

…

On Fri, Jun 14, 2024 at 6:15 PM Nico Kutscherauer ***@***.***> wrote: This discussion goes in a direction I did not expected. And I'm not sure if I understand everything. But my impression is, that your focus is on my first issue: This has the following problems: - If you have multiple occurrences of "foo" in a text node you get always only one validation error for this node. At least as important is the second issue: - If you have a long text node it may very hard to find the exact phrase which causes the error, especially if the searched phrase is a single sign and/or something well hidden. Let me demonstrate this on an example. I implemented @rjelliffe <https://github.com/rjelliffe>'s approach for a regex match with pure Schematron. That's how it looks in Oxygen UI: grafik.png (view on web) <https://github.com/Schematron/schematron-enhancement-proposals/assets/9881428/7d622659-83b1-4f1a-975f-8fcc7b9ca1b3> Now compare it with the same regex match just with the Escali extension: grafik.png (view on web) <https://github.com/Schematron/schematron-enhancement-proposals/assets/9881428/8fa045c6-7bed-4d39-837e-24931eb1738a> What do you think how much time you can save without searching for the exactly location of the bad terms in just that small paragraphs? And now assume you have much longer documents with a lot of checks for much more hidden patterns (like abbreviations)! If I understand @AndrewSales <https://github.com/AndrewSales> correctly, @visit-each would allow a free XPath expression. But that would not make it possible for an UI to provide the exact position of the matched patterns, would it? — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF65KKITLL3E6EIP7Y5H35LZHKRDZAVCNFSM6AAAAABJF5HP2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGQ4TSNZQGI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

rjelliffe · 2024-06-14T15:38:41Z

Nico wrote:

Now compare it with the same regex match just with the Escali extension

This is not a fair comparison. If you wrote a custom extension for the regex approach you could get specific underlines too.

rjelliffe · 2024-06-15T04:54:27Z

I have updated previous answer in the wiki. I meant to use string-length() not count() to get the start position in the text. Rick On Sat, Jun 15, 2024 at 1:30 AM Rick Jelliffe ***@***.***> wrote:

…

> What do you think how much time you can save without searching for the exactly location of the bad terms in just that small paragraphs? And now assume you have much longer documents with a lot of checks for much more hidden patterns (like abbreviations)! > If I understand @AndrewSales <https://github.com/AndrewSales> correctly, @visit-each would allow a free XPath expression. But that would not make it possible for an UI to provide the exact position of the matched patterns, would it? If I can generalize Nico's issue, the question for the @visit-each approach is how each item on the transformed data carries information relating the original context with it, suitable for SRVL (or other outputs) to exactly pinpoint the location. It would be a backwards step for Schematron to lose track of the positions in the XML, after all. But XDM does not support "slices" (substrings that are indexes into an existing string.) I think the answer is that the QLB (e.g., for XSLT 3) specifies the standard transformation functions it supports, and these must have XML output with some index information. * So the implementation parses the @view-as value for QLB specified keywords such as 'analyse-string' and tweaks their output accordingly. In the case of analyse-string(), this would involve each analyse-string-result/* element being decorated with an attribute @sch:start. This would contain count(string-join(preceding-sibling::*)). What about functions that don't return XML elements? In the case of, say, tokenize() which goes from text() to xs:string*, the implementation would wrap each token with some element tag that includes the start index. e.g ```` <sch:rule context="fred" view-as="tokenize( '(a*b*c*)* )"> ```` where element fred has value 'abcabacb' would iterate over say ```` <string sch:start="1">abc</token> <string sch:start="4">ab</token> <string sch:start="6">ac</token> <string sch:start="8">b</token> ```` So I expect that that @view-as needs to be split to expose the XML to be tweaked E.g: ```` <sch:rule context="dog" view-as="licence/analyse-string( ...) " each="/analyse-string-result/matched" /> or <sch:rule context="dog"> <sch:view-as with="licence/analyse-string( ...) " select="/analyse-string-result/matched" /> ```` So how would you then transfer the index to the outside SVRL etc? You would use properties, which are the generic way to decorate the SVRL (or whatever) with dynamic information intended for use by subsequent processes rather than humans directly. : ```` <sch:report text="contains(., 'kitty')" properties="string-position" >A dog licence should not have the string 'kitty'</report> ... <sch:property name="string-position"> <string-start><sch:value ***@***.***'/></string-start> <string-end><sch:value ***@***.*** + length(.)"></string-end> </sch:property> ```` In other words, to register a function for @view-as the QLB needs to specify; - its name that will be recognized - how to tweak the functions output so that is XML with appropriate index markup (e.g., sch:string) - plus probably some standardized property for supporting slices, to transfer the range indexes into the SVRL. (I am not sure, but did I hear that XPath 4 might include slices or text ranges? @view-as might find that useful.) Rick * @andrew: presumably, ISO Schematron 2005 standard would have a new annex specifying how some standard set of XSLT functions that go from text() to element()* are handled and decorated. The each QLB that was interested would reference that. I think the QLB mechamism is appropriate, as it is intended to ensure that an implementation has everything needed to interpret the schema. On Fri, Jun 14, 2024 at 6:15 PM Nico Kutscherauer < ***@***.***> wrote: > This discussion goes in a direction I did not expected. And I'm not sure > if I understand everything. > > But my impression is, that your focus is on my first issue: > > This has the following problems: > > - If you have multiple occurrences of "foo" in a text node you get > always only one validation error for this node. > > At least as important is the second issue: > > > - If you have a long text node it may very hard to find the exact > phrase which causes the error, especially if the searched phrase is a > single sign and/or something well hidden. > > Let me demonstrate this on an example. > > I implemented @rjelliffe <https://github.com/rjelliffe>'s approach for a > regex match with pure Schematron. That's how it looks in Oxygen UI: > > grafik.png (view on web) > <https://github.com/Schematron/schematron-enhancement-proposals/assets/9881428/7d622659-83b1-4f1a-975f-8fcc7b9ca1b3> > > Now compare it with the same regex match just with the Escali extension: > > grafik.png (view on web) > <https://github.com/Schematron/schematron-enhancement-proposals/assets/9881428/8fa045c6-7bed-4d39-837e-24931eb1738a> > > What do you think how much time you can save without searching for the > exactly location of the bad terms in just that small paragraphs? And now > assume you have much longer documents with a lot of checks for much more > hidden patterns (like abbreviations)! > > If I understand @AndrewSales <https://github.com/AndrewSales> correctly, > @visit-each would allow a free XPath expression. But that would not make > it possible for an UI to provide the exact position of the matched > patterns, would it? > > — > Reply to this email directly, view it on GitHub > <#75 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AF65KKITLL3E6EIP7Y5H35LZHKRDZAVCNFSM6AAAAABJF5HP2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGQ4TSNZQGI> > . > You are receiving this because you were mentioned.Message ID: > ***@***.*** > com> >

rjelliffe · 2024-06-16T04:44:23Z

I have re-worked my view-as proposal into #76.

This expresses it with more issues addressed, and I didn't want to hijack this thread, so that Nico's proposal can be discussed on its own merits.

AndrewSales · 2024-06-17T12:28:19Z

This thread has developed a lot since I last commented.

@nkutsche , I understand the issue and what you want to achieve.

I favour the additional attribute @visit-each allowing anything that evaluates to a sequence to keep things as generic as possible.

The specific, and likely most common, use case of matching substrings hinges on reporting the locations of those substrings. XPointer offered something here, but I'm not sure there was much uptake.

My feeling is the standard should incorporate the new attribute, make its intended purpose clear, and suggest (non-normatively) a possible method of passing sufficient substring information through to SVRL. For example, where analyze-string() is used, the returned fn:analyze-string-result will sit quite happily in SVRL and provide what an implementation or IDE extension needs to give more information to the user.

It might be clearest to write such rules as:

<rule context='para' visit-each='analyze-string(., "foo")'>
  <report test='fn:match'>[...]

so that an implementation can report the result of visit-each straightforwardly in SVRL.

I don't think the standard should attempt to define methods of capturing, expressing, obtaining or reporting substrings beyond this, as in my view that would tend towards a Schematron-specific subsetting of XPath and the XDM. Implementations are of course free to do what they need to internally.

(@nkutsche - just a thought: what if an SQF micro-transform added the match markup to the string? Then your headline report tells you "trouble in this para", when you get there, user has SQF mark up the problems, preferably in an sqf:* element invalid in the host doc.)

…osals#75

dmj · 2024-09-30T12:26:16Z

Wouldn't visitEach a better name? As far as I can see all other ISO Schematron attributes use camel case.

dmj · 2024-09-30T12:39:49Z

If the @visit-each sets the context for the assertion tests does it also set the context for sch:let variables?

rjelliffe · 2024-10-01T03:35:50Z

If the minimal approach is desired, I can see five approaches. I like 1 and , and 2 is OK. *1. Different Context* ``` <sch:rule context="fred" viewAs="my:tokenize( '(a*b*c*)* )"> <sch:let name="freds-parent" value="./parent::*" as="element()" /> <sch:assert test="$freds-parent/name()='Mavis' and contains(., 'c')"> Fred's parent must be Mavis, and each abc token must contain at least one 'c'. </sch:assert> </sch:rule> ``` I.e. the context for a variable is the ***@***.***, but the context for an assertion (sch:assert or sch:report) is the token. (I.e. if a report or assert needs to get access to the ***@***.***, it needs to do it through a variable.) *Example: CSV* Lets validate embedded CSV cells. Lets say the user defines a function my:parseCsv() that parses newlines and "," into rows and cells. And lets say that ***@***.*** allows @select to select which data to iterate over. So our document is: ``` <data>abc|123|456 def|123|456</data> ``` Which might parse into: ``` <csv> <row n=1> <c n=1>abc</c> <c n=2>123</c> <c n=3>456</c> </row> <row n=2> <c n=1>def</c> <c n=2>123</c> <c n=3>456</c> </row> </csv> ``` Which can be validated by: ``` <sch:rule context="csv" viewAs="my:parseCsv(.)" select="/csv/row" > <sch:assert test="self::row" role="unit-test">Only validates rows</sch:assert> <sch:assert test="number(c[2]) and c[2]!=0">The second cell must be a number</sch:assert> <sch:rule> ``` *2. Same Context* Another approach would be ``` <sch:rule context="fred" viewAs="my:tokenize( '(a*b*c*)* )"> <sch:assert test="$sch:rule-context/parent::*/name()='Mavis' and contains(., 'c')"> Fred's parent must be Mavis, and each abc token must contain at least one 'c'. </sch:assert> </sch:rule> ``` In this approach, the original rule-context is provided by a new built-in variable $sch:rule-context. Variables and asserts would operate in the context of the token. So this would be the same as the previous: ``` <sch:rule context="fred" viewAs="my:tokenize( '(a*b*c*)* )"> <sch:let name="freds-parent" value="$sch:rule-context/parent::*" as="element()" /> <sch:assert test="$freds-parent/name()='Mavis' and contains(., 'c')"> Fred's parent must be Mavis, and each abc token must contain at least one 'c'. </sch:assert> </sch:rule> ``` When defining this, I suppose that the QLBs should state - xslt1 and xpath1 : ***@***.*** not available (as they have no regex capabilities) - or, provide some simple tokenizing mechanism such as a just tokenising on whitespace+"|" - xslt2 and xpath2: regex with no capture group: tokenize() - xslt3 and xpath3: tokenize() and regex with capture group analyze-string() But I don't have a strong opinion either way. And I still suspect that the full sch:viewAs element offers more flexibility (e.g. with directory listings, custom functions, or having to interact with e.g. Java extension functions.) *3. Allow ***@***.*** to specify functions* I don't like this one much. Another approach might be to allow ***@***.*** to specify compound paths with functions rather than just being simple location steps. ```` <sch:rule context="text()/analyze-string(., '[^\s]+')/analyze-string-result/match"> <sch:report test=".='foo'">Oh dear I found a foo</sch:report> </sch:rule> ```` The XPath3 analyze-string function returns an element. In this case, the compiler sees that there is an "/analyze-string(." text, and splits it to do the iteration. This has the advantage of not requiring extra schematron elements or concepts. But it might stuff things up if anyone wanted to use analyze-string inside a predicate in the @context location step (which might not be a common thing.) So it seems too messy. *4. ***@***.**** I don't like this one much. Another approach might be to use the existing ***@***.*** ```` <sch:schema ...> ... <sch:let name="analyzed-text" value="analyze-string(string-join(//text(), ' '), '[^\s]+)" as="element()" /> <sch:pattern documents="$analyzed-text"> <sch:rule name="match"> <sch:report test=".='foo'>Found a foo!</report> <sch:rule> <sch:pattern> ```` So this does not require any change to Schematron 2016 (IIRC) syntax. Analyze-string() returns an element not a document, so the compiler or engine would have to convert the element to a document. The downside is that this only allows a very coarse-grain approach. And the item cannot be reported in terms of its original context, which I think is not satisfactory. *5. Extend ***@***.*** to allow variable* In this approach we say that rulesets can define their own data source for the rule. So ```` <sch:ruleset name="text-no-noes"> <sch:let name="tokenized-text" value="for $t in //*[text()] return analyze-string(., '[^\s]+)" as="element()*"/> <sch:rule context="$tokenized-text/analyze-string-result/match"> <sch:report test=".='foo' ">Found a foo </sch:report> ... ```` This works because there is no need to restrict ***@***.*** to a simple location path (i.e. what can appear in ***@***.***). So ***@***.*** can continue the way it is: validating the provided document. But rulesets would have the added capability of being able to operate on views. (I guess the implementation would just check if the first character of the @context was "$" if it needed to generate different code for a view.) Rick

…

On Mon, Sep 30, 2024 at 10:40 PM David Maus ***@***.***> wrote: If the @visit-each sets the context for the assertion tests does it also set the context for sch:let variables? — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF65KKJKRZK4RC4RG6YN6JDZZFBCXAVCNFSM6AAAAABJF5HP2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBTGA3TMMZZGY> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

dmj · 2024-10-01T07:21:32Z

If the minimal approach is desired, I can see five approaches. I like 1, and 2 is OK.

Same here.

dmj · 2024-10-01T07:39:50Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking for substrings of the node values #75

Checking for substrings of the node values #75

nkutsche commented Jun 12, 2024

rjelliffe commented Jun 12, 2024 •

edited

Loading

AndrewSales commented Jun 12, 2024

rjelliffe commented Jun 13, 2024

rjelliffe commented Jun 13, 2024 •

edited

Loading

dmj commented Jun 13, 2024

rjelliffe commented Jun 13, 2024 via email

nkutsche commented Jun 14, 2024

rjelliffe commented Jun 14, 2024 via email •

edited

Loading

rjelliffe commented Jun 14, 2024

rjelliffe commented Jun 15, 2024 via email

rjelliffe commented Jun 16, 2024

AndrewSales commented Jun 17, 2024 •

edited

Loading

dmj commented Sep 30, 2024

dmj commented Sep 30, 2024

rjelliffe commented Oct 1, 2024 via email

dmj commented Oct 1, 2024

dmj commented Oct 1, 2024

rjelliffe commented Oct 2, 2024 via email

AndrewSales commented Oct 27, 2024

rjelliffe commented Oct 27, 2024 via email

Checking for substrings of the node values #75

Checking for substrings of the node values #75

Comments

nkutsche commented Jun 12, 2024

rjelliffe commented Jun 12, 2024 • edited Loading

AndrewSales commented Jun 12, 2024

rjelliffe commented Jun 13, 2024

rjelliffe commented Jun 13, 2024 • edited Loading

dmj commented Jun 13, 2024

rjelliffe commented Jun 13, 2024 via email

nkutsche commented Jun 14, 2024

rjelliffe commented Jun 14, 2024 via email • edited Loading

rjelliffe commented Jun 14, 2024

rjelliffe commented Jun 15, 2024 via email

rjelliffe commented Jun 16, 2024

AndrewSales commented Jun 17, 2024 • edited Loading

dmj commented Sep 30, 2024

dmj commented Sep 30, 2024

rjelliffe commented Oct 1, 2024 via email

dmj commented Oct 1, 2024

dmj commented Oct 1, 2024

rjelliffe commented Oct 2, 2024 via email

AndrewSales commented Oct 27, 2024

rjelliffe commented Oct 27, 2024 via email

rjelliffe commented Jun 12, 2024 •

edited

Loading

rjelliffe commented Jun 13, 2024 •

edited

Loading

rjelliffe commented Jun 14, 2024 via email •

edited

Loading

AndrewSales commented Jun 17, 2024 •

edited

Loading