NOTE: This is not abandoned code, it's just that is so simple, it really doesn't need much maintenance. Still, any issues will be treated with due diligence.
dinant
is an attempt, like may others, to make regular expressions more
readable and, like many others, it fails miserably... but we try anyways.
You can find many examples in the source file, which includes unit tests to make sure we don't make things worse. Because it's implementation is currently very, very simple, it does not make any checks, so you can shoot your own foot. Also, because it doesn't even attempt to, it does not makes any optimizations, and resulting regexps can be more complex to read and less efficient. But the idea is that you would never see them again. For instance:
capture( one_or_more(any_of('a-z')) ) + zero_or_more(then('[') + capture( zero_or_more(any_of('a-z')) ) + then(']'))
becomes ((?:[a-z])+)(?:\[((?:[a-z])*)\])*
and not ([a-z]+)(?:\[([a-z]*)\])*
.
dinant
has evolved a bit, trying to give alternatives that might please other
points of view, so you can write the above as:
any_of('a-z', times=[1, ], capture=True) + zero_or_more( '[' + any_of('a-z', times=[0, ], capture=True) + ']' )
or even:
any_of('a-z', times=[1, ], capture=True) + ( '[' + any_of('a-z', times=[0, ], capture=True) + ']' )(times=[0, ])
You might say that that expression is more difficult to read than a regular expression, and I half agree with you. You could split your expression in its components:
name = one_or_more(any_of('a-z'))
key = zero_or_more(any_of('a-z'))
subexp = ( capture(name, 'name') + zero_or_more(then('[') + capture(key, 'key') + then(']')) )
That version of capture can be rewritten as:
subexp = ( name(name='name') + zero_or_more(then('[') + key(name='key') + then(']')) )
One cool feature: a Dinant
expression (object) can tell you which part of your
expression fails:
# this is a real world example!
In [1]: import dinant as d
In [2]: line = """36569.12ms (cpu 35251.71ms)\n"""
# can you spot the error?
In [3]: render_time_re = ( d.bol + d.capture(d.float, name='wall_time') + 'ms ' +
...: '(cpu' + d.capture(d.float, name='cpu_time') + 'ms)' + d.eol )
In [4]: print(render_time_re.match(line))
None
In [5]: print(render_time_re.debug(line))
# ok, this is too verbose (I hope next version will be more human readable)
# but it's clear it's the second capture
Out[5]: '^(?P<wall_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))ms\\ \\(cpu(?P<cpu_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))'
debug()
's result is the first subexpression that does not match; in this case
it's the second d.capture(d.float, ...)
, so the bug is either there or in the
previous subexpression. It turns out to be that (cpu
needs an extra space:
In [6]: render_time_re = ( d.bol + d.capture(d.float, name='wall_time') + 'ms ' +
...: '(cpu ' + d.capture(d.float, name='cpu_time') + 'ms)' + d.eol )
In [7]: print(render_time_re.match(line))
<_sre.SRE_Match object; span=(0, 27), match='36569.12ms (cpu 35251.71ms)'>
If the module is run as a script, it will accept such an expression and print to
stdout
the generated regexp:
$ python3 -m dinant "bol + 'run' + any_of('-_ ') + 'test' + maybe('s') + eol"
^run[-_ ]test(?:s)?$
What about the name? It's a nice town in België/Belgique/Belgien that I plan to visit some time. It also could mean 'dinning person' in French[1], which makes sense, as I wrote this during dinner.
dinant
builds regular expressions (regexps) by concatenating and composing its
parts. Here's a list of available elements, following Python's re
page. Here,
re
is a dinant
regexp; m
and n
are integers; and s
and name
are
strings.
anything
is.
.bol
is^
(begin of line).eol
is$
(end of line).zero_or_more(re)
is(re)*
, matchingre
zero or more times. I can be also be written asre(times=[0 ])
.one_or_more(re)
is(re)+
; alsore(times=[1, ])
.maybe(re)
is(re)?
; alsore(times=[..., 1]
orre(times=[0, 1]
.- Non-greedy versions are generated by adding
greedy=False
to the parameters of the regexp:zero_or_more(re, greedy=False)
. exactly(m, re)
is(re){m}
; alsore(times=m)
.between(m, n, re)
is(re){m. n}
; alsore(times=[m, n]
; with non greedy version:between(m, n, re, greedy=False)
.at_most(m, re)
andat_least(m, re)
are shortcuts forbetween(None, m, re)
andbetween(m, None, re)
; alsore(times=[..., m])
andre(times[m, ...])
. Here...
is the actualEllipsis
literal.text(s)
andthen(s)
match exactlys
, so it's escaped. You can also concatenate the string:s + re
orre + s
. This means you don't have to escape your strings.any_of(s)
is[s]
, wheres
has to be in adequate format to be between[]
s. Checkre
's doc if unsure.none_of(s)
is[^s]
.either(re, ...)
is(re|...)
.capture(re, [name=name])
captures the subexpression, optionally with a name; alsore(capture=True)
orre(capture=name)
for the named version.- By default no subexpression is captured unless wrapped in
capture()
orcapture
is passed as parameter. backref(name)
,comment(s)
,lookahead(re)
,neg_lookahead(re)
,lookbehind(re)
andneg_lookbehind(re)
work as expected.regexp(s)
treats s as a pure regexp, so no escaping here.
Nothing strange so far, just alternative ways to express the same. Now the real
potential of dinant
starts to show.
digit
matches any (ascii) digit.digits
too (think ofone_or_more(digits)
).uint
matches any positive (plus 0) integer.int
andinteger
match any integer.- Guess what
float
,hex
andhexa
do (hint: the last two are the same). datetime([format], [buggy_day=False])
matches datetimes usingformat
, which is conveniently written instrptime()
language.buggy_day
is there because%d
matches08
but not8
.IPv4()
matches IPv4 addresses!IP_port
matches strings in formatIPv4:port
.
That's all for now. More will come soon, see TODO.md
and the issued for a preview.
[1] but the real word is 'dîneur'.