Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for substrings of the node values #75

Open
nkutsche opened this issue Jun 12, 2024 · 20 comments
Open

Checking for substrings of the node values #75

nkutsche opened this issue Jun 12, 2024 · 20 comments
Labels
2025 A change made in preparing the 2025 edition

Comments

@nkutsche
Copy link

I developed years ago an extension of Schematron for my Escali to enhance the possibilities of checking sub-phrases inside of text nodes. During XMLPrague 2024 this topic came up again in discussions and I would like to follow up this discussion here:

Currently you can only check nodes in Schematron. If you want to avoid for instance the phrase "foo" in any text node you would do the following:

<sch:rule context="text()">
    <sch:report test="matches(., 'foo')">Don't use 'foo'</sch:report>
</sch:rule>

This has the following problems:

  • If you have multiple occurrences of "foo" in a text node you get always only one validation error for this node.
  • If you have a long text node it may very hard to find the exact phrase which causes the error, especially if the searched phrase is a single sign and/or something well hidden.

The basic idea of the extension was to provide a regular expression additional to the matched context. I would suggest something like this (reworking my initial proposal):

<sch:rule context="text()">
    <sch:report-phrase regex="foo" flags="i">Don't use '<sch:value-of select="."/>'</sch:report>
</sch:rule>

The following rules should be applied:

  • The context node is atomized and analyzed by the given regular expression in @regex
    • Any match will produce a potential validation error.
    • The flags in @flags are used for the regex analyzing as known in several XPath functions (e.g. match#3)
  • A sch:report-phrase/@test attribute can be added optionally
    • The value is an XPath expression, the context is the matched substring (similar as in xsl:analyze-string)
    • If the return value is true the matched substring will produce a validation error
    • The default value is true()
  • The context inside of sch:report-phrase is also the matched substring

I don't think it is hard to implement, as the behavior is completly covered by xsl:analyze-string. The highest challenge is to specify the location of the substring and integrate it into SVRL, isn't it? I did it in my implementation by adding character positions (start and length) to the location node path.

What do you think?

@rjelliffe
Copy link
Member

rjelliffe commented Jun 12, 2024

(revised)
At the moment, I think Schematron supports the following (relying on analyze-string() to return an element tree):

<sch:rule context="xxx/text()">
   <sch:let name="analyzed-string"  select-"analyze-text(., 'foo')"       
                                                             as="element(analyze-string-result)?" />
   <sch:let name="first-matched"    select=$analyzed-string//matched[1]" />
   <sch:let name="second-matched"  select=$analyzed-string//matched[2]" />
   <sch:let name="third-matched"     select=$analyzed-string//matched[3]" />
   <sch:let name="talley-of-matched" select="count($analyzed-string//matched)" />

   <sch:report test="$first-matched">
       The naughty text 'foo' was found in text <sch:value-of select="
          concat($first-matched/preceding-sibling::*[1], $first-matched, $first-matched/following-sibling::*[1])" />"</sch:report>

   <sch:report test="$second-matched">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat($second-matched/preceding-sibling::*[1], $second-matched, $second-matched/following-sibling::*[1])" />"</sch:report>

   <sch:report test="$third-matched">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat($third-matched/preceding-sibling::*[1], $third-matched, $third-matched/following-sibling::*[1])" />"</sch:report> 

    <sch:report test="$talley-of-matches &gt; 3">Multiple further 'foo' were found.</sch:report>
</sch:rule>

This gives what I think is the desired behaviour: multiple reports, and with better diagnostics. For the particular use-case that nkutsche provides, I suggest the approach above would be all that is needed: it is not hard to enumerate small numbers of rules, where there are small numbers of candidates.

BUT... if it could be shown that there was a use-case for long list or complex data created by functions, then we do have a gap in Schematron.

(In this view, the problem is that there is no way to iterate assertions/reports over some element in a variable containing a tree of elements. Or, to iterate over a list of things in an variable. )

One way to approach it would be to adopt something like proposal #63, to allow sch:pattern/@document to reference a document (or sequence of elements) in a variable.

<sch:let name="list-of-analysed-text" 
     value="for $t in //xxx/text() return analyse-string($t, 'foo')" 
     as="element()*" />
<sch:pattern document="$list-of-analysed-text" >
    <sch:rule context="/analyse-string-result/matched">
         <sch:report test="true()">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat(preceding-sibling::*[1], ., following-sibling::*[1])" />"</sch:report>
    </sch:rule>
</sch:pattern>  

The advantage, perhaps, of this is that it gives the full expressive power of Schematron to the parsed text and it fits in with the notion of a "pattern" better.

Another approach would be to do something on sch:rule, such as an attribute @visit-each:

<sch:pattern  >
    <sch:rule context="xxx/text()"  visit-each="analyze-string(., 'foo')/analyze-string-view/match" > 
         <sch:report test="true()">
       The naughty text 'foo' was found in the text <sch:value-of select="
          concat(preceding-sibling::*[1], ., following-sibling::*[1])" />"</sch:report>
    </sch:rule>
</sch:pattern>  

I think it is good for the standard to try to avoid mixing things related to particular QLBs (i.e. XSLT and the limitations of the XDM) and better to put in generic changes that can apply to all QLBs.

@AndrewSales
Copy link
Collaborator

Thanks, @nkutsche and @rjelliffe . FWIW, I like:

<sch:rule context="xxx/text()" visit-each="analyze-string(., 'foo')/analyze-string-view/match">

I have a dim recollection that we also had something similar in the (proprietary) syntax of XMLProbe.

The processing logic would presumably be:

  • @visit-each evaluates to a sequence of any length against the rule/@context;
  • if the result is the empty sequence, the rule still fires;
  • the @tests are then evaluated against every item in the sequence.

It would seem to make sense to pass @visit-each through to the SVRL output too.

@rjelliffe
Copy link
Member

Ideally, the @subject should be an XPath that corresponds to the subrange matched too.

The <matched and <unmatched are not available as indexes to the original string.

But lets suppose the @visit-each gets constrained to only include things specified in the QLB. So for xslt2 and 3 it might be the function analyse-string() only. (Are there any other functions in Xpath2 and 3 that return elements?)

If this was the constraint, then an implementation could certainly count the characters in the analyse-string-result/*/text() to find the index into the original text() or whatever and put out an Xpath with substring() as the @subject.

@rjelliffe
Copy link
Member

rjelliffe commented Jun 13, 2024

Is it that the general case here is what we used to call "embedded notations"? I.e. where there is some language embedded.

For example, CSS, JavaScript and so on. JSON embedded in XML. Or SQL etc. ... or XPath.

For example, imagine a Schematron schema for XSLT that allowed us to apply house-style rules to the Xpaths?

  <sch:rule context="xslt:template/@match"  view-as="parse-xpath-to-xml(.)/Im-a-parsed-xpath-top-element" >
      <xsl:report test="/Im-a-parsed-xpath-top-element//function[@value='meep-meepl']">
               According to our rules, we do not allow the meep-meep() function.
     </xsl:report>
  </sch:rule>  

This is, of course, one of the way that syntax checkers like Java FindBugs or PMD work. They create an XML parse tree, then use some XPath system to express rules.

@dmj
Copy link
Member

dmj commented Jun 13, 2024

Thanks, @nkutsche and @rjelliffe . FWIW, I like:

<sch:rule context="xxx/text()" visit-each="analyze-string(., 'foo')/analyze-string-view/match">

I have a dim recollection that we also had something similar in the (proprietary) syntax of XMLProbe.

The processing logic would presumably be:

* `@visit-each` evaluates to a sequence of any length against the `rule/@context`;

* if the result is the empty sequence, the rule still fires;

* the `@test`s are then evaluated against every item in the sequence.

It would seem to make sense to pass @visit-each through to the SVRL output too.

+1 for @‍visit-each.

I would not restrict the value to a particular function but a) its specification part of the query language bindinung and b) define the respective XPath version for XSLT & XPath.

Running the assertion over a transformed/parsed context would also allow the validation of structured data in attributes ("embedded notations") easier to express.

@rjelliffe
Copy link
Member

rjelliffe commented Jun 13, 2024 via email

@nkutsche
Copy link
Author

This discussion goes in a direction I did not expected. And I'm not sure if I understand everything.

But my impression is, that your focus is on my first issue:

This has the following problems:

  • If you have multiple occurrences of "foo" in a text node you get always only one validation error for this node.

At least as important is the second issue:

  • If you have a long text node it may very hard to find the exact phrase which causes the error, especially if the searched phrase is a single sign and/or something well hidden.

Let me demonstrate this on an example.

I implemented @rjelliffe's approach for a regex match with pure Schematron. That's how it looks in Oxygen UI:

grafik

Now compare it with the same regex match just with the Escali extension:

grafik

What do you think how much time you can save without searching for the exactly location of the bad terms in just that small paragraphs? And now assume you have much longer documents with a lot of checks for much more hidden patterns (like abbreviations)!

If I understand @AndrewSales correctly, @visit-each would allow a free XPath expression. But that would not make it possible for an UI to provide the exact position of the matched patterns, would it?

@rjelliffe
Copy link
Member

rjelliffe commented Jun 14, 2024 via email

@rjelliffe
Copy link
Member

Nico wrote:

Now compare it with the same regex match just with the Escali extension

This is not a fair comparison. If you wrote a custom extension for the regex approach you could get specific underlines too.

@rjelliffe
Copy link
Member

rjelliffe commented Jun 15, 2024 via email

@rjelliffe
Copy link
Member

I have re-worked my view-as proposal into #76.

This expresses it with more issues addressed, and I didn't want to hijack this thread, so that Nico's proposal can be discussed on its own merits.

@AndrewSales
Copy link
Collaborator

AndrewSales commented Jun 17, 2024

This thread has developed a lot since I last commented.

@nkutsche , I understand the issue and what you want to achieve.

I favour the additional attribute @visit-each allowing anything that evaluates to a sequence to keep things as generic as possible.

The specific, and likely most common, use case of matching substrings hinges on reporting the locations of those substrings. XPointer offered something here, but I'm not sure there was much uptake.

My feeling is the standard should incorporate the new attribute, make its intended purpose clear, and suggest (non-normatively) a possible method of passing sufficient substring information through to SVRL. For example, where analyze-string() is used, the returned fn:analyze-string-result will sit quite happily in SVRL and provide what an implementation or IDE extension needs to give more information to the user.

It might be clearest to write such rules as:

<rule context='para' visit-each='analyze-string(., "foo")'>
  <report test='fn:match'>[...]

so that an implementation can report the result of visit-each straightforwardly in SVRL.

I don't think the standard should attempt to define methods of capturing, expressing, obtaining or reporting substrings beyond this, as in my view that would tend towards a Schematron-specific subsetting of XPath and the XDM. Implementations are of course free to do what they need to internally.

(@nkutsche - just a thought: what if an SQF micro-transform added the match markup to the string? Then your headline report tells you "trouble in this para", when you get there, user has SQF mark up the problems, preferably in an sqf:* element invalid in the host doc.)

@AndrewSales AndrewSales added the 2025 A change made in preparing the 2025 edition label Jul 2, 2024
@dmj
Copy link
Member

dmj commented Sep 30, 2024

Wouldn't visitEach a better name? As far as I can see all other ISO Schematron attributes use camel case.

@dmj
Copy link
Member

dmj commented Sep 30, 2024

If the @visit-each sets the context for the assertion tests does it also set the context for sch:let variables?

@rjelliffe
Copy link
Member

rjelliffe commented Oct 1, 2024 via email

@dmj
Copy link
Member

dmj commented Oct 1, 2024

If the minimal approach is desired, I can see five approaches. I like 1, and 2 is OK.

Same here.

@dmj
Copy link
Member

dmj commented Oct 1, 2024

More questions:

  • does @‍visit-each set the context for dynamic expressions in properties and diagnostics?
  • does @‍visit-each set the context for the name query?
  • does @‍visit-each set the context for the value-of query?

@rjelliffe
Copy link
Member

rjelliffe commented Oct 2, 2024 via email

@AndrewSales
Copy link
Collaborator

@dmj

Wouldn't visitEach a better name? As far as I can see all other ISO Schematron attributes use camel case.

Well, there is is-a too, but I take your point.

If the @visit-each sets the context for the assertion tests does it also set the context for sch:let variables?

Good question. This isn't (yet) clarified in the text of the standard.

@rjelliffe the issue with your submissions via email persists whereby markup(?) is substituted by ***@***.***, making your examples unintelligible.

@rjelliffe
Copy link
Member

rjelliffe commented Oct 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2025 A change made in preparing the 2025 edition
Projects
None yet
Development

No branches or pull requests

4 participants