Exponentially-slow regex #15

giovannipizzi · 2018-10-03T17:06:04Z

As discussed in e.g. these links [1,2] there are some regex that, with the default python (or perl, ...) implementation, can become exponentially slow with some inputs.

To reproduce, one can e.g. run the following, where cell_parameters_block_re is the one defined in qeinputparser.py:

cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""
print cell_parameters_block_re.match(cell_string) # does not end after >60seconds
print cell_parameters_block_re.match(cell_string + "\n") # ends instantaneously

Already using a Thompson NFA implementation improves the situation: e.g. I tried this from this BitBucket repo with pip install regex and did a import regex as re as a drop-in replacement for re when generating cell_parameters_block_re can alleviate the problem, bringing the time down to ~3sec.
This can be done, adding a dependency, still it's not perfect and probably our regex can be improved.

At the moment, I suggest to always append a newline at the end of the input for safety, and if we see that this is not enough, we open this issue again and replace re with regex (and/or improve the regex).

[1] https://stackoverflow.com/questions/25982466/why-does-this-take-so-long-to-match-is-it-a-bug
[2] https://swtch.com/~rsc/regexp/regexp1.html

Mentioning @lekah

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2018-10-03T17:10:45Z

To test (fast version, with newline):

# No newline in this string
cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""

from qe_tools.parsers.qeinputparser import parse_cell_parameters
print(parse_cell_parameters(cell_string + '\n'))

This is probably not a real fix but only an alleviation for #15.

giovannipizzi · 2018-10-04T10:50:34Z

At the moment, we added a 'workaround' that solves the problem in a known case.
Please report here any further occurrences of the problem - if we have no news in the next few months we can close this issue.

giovannipizzi self-assigned this Oct 3, 2018

giovannipizzi added a commit that referenced this issue Oct 3, 2018

Adding a newline before parsing.

ccdddc8

This is probably not a real fix but only an alleviation for #15.

This was referenced Oct 3, 2018

Fix exponential regex time #16

Merged

Infinite parsing time when parsing some specific files giovannipizzi/seekpath#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponentially-slow regex #15

Exponentially-slow regex #15

giovannipizzi commented Oct 3, 2018

giovannipizzi commented Oct 3, 2018

giovannipizzi commented Oct 4, 2018

Exponentially-slow regex #15

Exponentially-slow regex #15

Comments

giovannipizzi commented Oct 3, 2018

giovannipizzi commented Oct 3, 2018

giovannipizzi commented Oct 4, 2018