Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex for "match everything till eol" matches nothing #231

Open
Hi-Angel opened this issue Oct 15, 2024 · 11 comments · May be fixed by #232
Open

Regex for "match everything till eol" matches nothing #231

Hi-Angel opened this issue Oct 15, 2024 · 11 comments · May be fixed by #232

Comments

@Hi-Angel
Copy link

Describe the bug

Using a pretty simple ".*$" regexp results in No Regex pattern match. That is with noFlags flag that implies multiline: false, so $ should match something.

To Reproduce

Run this code:

module Main where

import Prelude

import Data.Either (Either(..))
import Data.String.Regex.Flags (noFlags)
import Effect (Effect)
import Effect.Console (logShow)
import Parsing (Parser, fail, runParser)
import Parsing.String (regex)

-- The regex from `Parsing` but with more convenient interface.
regexP :: String -> Parser String String
regexP re = case regex re noFlags of
  Left compileError -> fail $ "failed to compile regex: " <> compileError
  Right ret -> ret

s = """My
multiline
string
"""

main :: Effect Unit
main = do
  logShow $ runParser s $ regexP ".*$"

Expected behavior

The parser ought to return string My

Actual behavior

ParseError "No Regex pattern match" (Position { column: 1, index: 0, line: 1 }
@garyb
Copy link
Member

garyb commented Oct 15, 2024

You would need the multiline flag to be enabled for it to match "My" - but it should return "", because it'll be matching on $ but nothing else (. doesn't match newlines without dotAll - if you remove the trailing \n it should return "string").

I'm not really sure where the culprit is here, neither the parsing nor Data.String regex stuff looks wrong from an initial glance.

@garyb
Copy link
Member

garyb commented Oct 15, 2024

Oh actually, immediately after saying that I see it. The parser regex inserts a ^ at the start of the pattern so multiline or dotAll would be required for anything to match. I think it's correct that it only matches from the current parser position, but maybe the consequence of that needs emphasising in the comment for it (it does already mention it matches from the current position).

@natefaubion
Copy link
Contributor

That is with noFlags flag that implies multiline: false, so $ should match something.

It seems like there's maybe a slight misunderstanding wrt JS RegExp semantics. Without the multiline flag $ always matches end of the string, not end of the line. Maybe you are expecting the inverse, where the multiline flag is where $ matches end of string, but it is in fact matching end of line? It's definitely confusing.

@Hi-Angel
Copy link
Author

That is with noFlags flag that implies multiline: false, so $ should match something.

It seems like there's maybe a slight misunderstanding wrt JS RegExp semantics. Without the multiline flag $ always matches end of the string, not end of the line. Maybe you are expecting the inverse, where the multiline flag is where $ matches end of string, but it is in fact matching end of line? It's definitely confusing.

Oh, just as I was typing out, you clarified, thank you! Yes, I am not a web-programmer, and in all other languages "multline regex being true" means the opposite, i.e. when you consider whole string more or less as a single line. Okay, I see now, let me try that.

@Hi-Angel
Copy link
Author

Oh actually, immediately after saying that I see it. The parser regex inserts a ^ at the start of the pattern so multiline or dotAll would be required for anything to match. I think it's correct that it only matches from the current parser position, but maybe the consequence of that needs emphasising in the comment for it (it does already mention it matches from the current position).

Sorry, I'm confused here 😅 So, the regex matches from current position, and for that it opaquely inserts a ^ to the beginning of the expression. This sounds like a bug on its own, because if a user have line1\n|line2 where the | marks the caret position, and then a user asks for ^line2, inserting the ^ will result in ^^line2.

@natefaubion
Copy link
Contributor

It gets wrapped with ^(...). So If you add an extra ^, it's unnecessary, but benign. Matching from the current cursor position is the only thing that makes sense for a parser that needs to consume it's input. I'm not sure what it would mean for a regex combinator to match arbitrary locations. The JS API doesn't tell you where matches happened, so we need an anchor that assumes we can just jump based on the length of the match.

Hi-Angel added a commit to Hi-Angel/purescript-parsing that referenced this issue Oct 15, 2024
Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
@Hi-Angel Hi-Angel linked a pull request Oct 15, 2024 that will close this issue
4 tasks
@Hi-Angel
Copy link
Author

Please see my docs change proposal here

Hi-Angel added a commit to Hi-Angel/purescript-parsing that referenced this issue Oct 15, 2024
Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
@jamesdbrock
Copy link
Member

Thanks for the docs change proposal.

Also note that if you want “The regex from Parsing but with more convenient interface” then you can just use the Data.String.Regex module directly instead of using Parsing.

@jamesdbrock
Copy link
Member

Also, I know this is just your test code, but note that in real code you shouldn't compile the regex during the parse for the reasons discussed in Parsing.String.regex.

-- The regex from `Parsing` but with more convenient interface.
regexP :: String -> Parser String String
regexP re = case regex re noFlags of
  Left compileError -> fail $ "failed to compile regex: " <> compileError
  Right ret -> ret

Hi-Angel added a commit to Hi-Angel/purescript-parsing that referenced this issue Oct 16, 2024
Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
Hi-Angel added a commit to Hi-Angel/purescript-parsing that referenced this issue Oct 16, 2024
Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
@Hi-Angel
Copy link
Author

Hi-Angel commented Oct 16, 2024

Thank you!

Also note that if you want “The regex from Parsing but with more convenient interface” then you can just use the Data.String.Regex module directly instead of using Parsing.

In the Parsing library context, the convenient interface is one that (in particular) returns a Parsing String String. The Data.String.Regex.regex returns Either, so still requires wrapping, and if I have to wrap, there's not much difference which one it's gonna be.

Also, I know this is just your test code, but note that in real code you shouldn't compile the regex during the parse for the reasons discussed in Parsing.String.regex.

Fair warning, because it's a snippet from an actual project. However, I've read the documentation paragraph referred (the discussion about constant vs dynamic generation in particular) but didn't grasp why should't I be compiling it in runtime.

Now that I'm experimenting with spy I see that it might be due to the fact that regexp doesn't get generated just once but on every call, perhaps that's what you were referring to? In that case though that's a problem, but not the one related to the code or the library, but one in the PureScript compiler. E.g. when I have this code:

-- a word at the beginning of line with optional whitespace
topLevelWord :: String -> Parser String String
topLevelWord s = regexP $ "^\\s*" <> s <> "\\b"

   topLevelWord "SomeFoo"

compiler should detect that it's a pure function-chain, so the calls can be replaced with a constant.

But it's part of the optimization process, and from what I understand the PureScript designers for some reason relegated optimizations to a separate project. So that means that if (or rather I'm hoping "when") my project comes to production, I'll have to use this separate project to produce the production code.

@garyb
Copy link
Member

garyb commented Oct 16, 2024

I see that it might be due to the fact that regexp doesn't get generated just once but on every call, perhaps that's what you were referring to?

That is what he meant, yeah.

Hi-Angel added a commit to Hi-Angel/purescript-parsing that referenced this issue Oct 16, 2024
Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants