Skip to content
feuerbach edited this page Oct 27, 2011 · 4 revisions

Update

Now this issue is resolved by adopting approach similar to that of re2 library. TODO: describe the details.

Motivation

At some point I got really annoyed by the absence of a good regular expression library for Haskell. I mean, we have quite a few of them, but at best they can provide the submatch information in the Perl style, i.e. as a list of strings that were matched.

Instead, it would be very Haskelly to have parsing combinators that would provide a nice high-level interface while using efficient regex implementation internally.

Implementation

Then I saw the paper "A Play on Regular Expressions", which shows how one can accumulate the information about submatches using a semiring, and it occurred to me that applicative functors make good semirings, if you ignore the types and don't mind possibly breaking a few laws. Naturally, <*> is multiplication and <|> is addition.

Semantics

There are two semantics for regexes: POSIX and Perl. POSIX regex engine is easier to implement efficiently, but it is of less use for parsing. Perl regexes, on the contrary, allow more control and more features (like non-greedy patterns). So, if the library is supposed to be practical, it needs the Perl semantics.

The Problem

Of course, there is a price to pay for the power of Perl-like regexes. (The rest of this story assumes that you've read the "Play" paper and understand how the Glushkov automaton works. And it still may be not clear enough until you read the code...)

The Perl semantics means, in particular, that we need to prefer the left part of an alternative over the right part. And the earlier alternatives take precedence over the later ones.

So, we need to keep the history of each "successfull" state as we transfer it around the tree. That's what Priority datatype is for. Internally it is a list of numbers (currently, just zeros and ones) that show which part we chose at each alternative. When we need to choose between two states, we compare their priorities using the lexicographical order.

Two problems with this approach are that it's not efficient (both "comparing" and "appending to the list" steps), and it requires the amount of memory linear in the string size.

There are some tricks to improve the constant factor (pack 0s and 1s as bits in an Int). Also there are tricks to help the memory consumption (collect and optimize the priority lists once in a while).

I feel that there should be a nice constant-space efficient algorithm to keep track of priorities, but failed to find it.

So...

I know there are a lot of bright minds in the Haskell community. It would be great if we manage to make this work.

You can find the code at https://github.com/feuerbach/regex-applicative

It's best to keep the discussion on the haskell-cafe mailing list.

I'm looking forward to your ideas and patches.

Responses

Discussion emerged in two places: on the Haskell reddit, and on the haskell-cafe mailing list.

Clone this wiki locally