Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponentially-slow regex #15

Open
giovannipizzi opened this issue Oct 3, 2018 · 2 comments
Open

Exponentially-slow regex #15

giovannipizzi opened this issue Oct 3, 2018 · 2 comments
Assignees

Comments

@giovannipizzi
Copy link
Member

As discussed in e.g. these links [1,2] there are some regex that, with the default python (or perl, ...) implementation, can become exponentially slow with some inputs.

To reproduce, one can e.g. run the following, where cell_parameters_block_re is the one defined in qeinputparser.py:

cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""
print cell_parameters_block_re.match(cell_string) # does not end after >60seconds
print cell_parameters_block_re.match(cell_string + "\n") # ends instantaneously

Already using a Thompson NFA implementation improves the situation: e.g. I tried this from this BitBucket repo with pip install regex and did a import regex as re as a drop-in replacement for re when generating cell_parameters_block_re can alleviate the problem, bringing the time down to ~3sec.
This can be done, adding a dependency, still it's not perfect and probably our regex can be improved.

At the moment, I suggest to always append a newline at the end of the input for safety, and if we see that this is not enough, we open this issue again and replace re with regex (and/or improve the regex).

[1] https://stackoverflow.com/questions/25982466/why-does-this-take-so-long-to-match-is-it-a-bug
[2] https://swtch.com/~rsc/regexp/regexp1.html

Mentioning @lekah

@giovannipizzi giovannipizzi self-assigned this Oct 3, 2018
@giovannipizzi
Copy link
Member Author

To test (fast version, with newline):

# No newline in this string
cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""

from qe_tools.parsers.qeinputparser import parse_cell_parameters
print(parse_cell_parameters(cell_string + '\n'))

giovannipizzi added a commit that referenced this issue Oct 3, 2018
This is probably not a real fix but only
an alleviation for #15.
@giovannipizzi
Copy link
Member Author

At the moment, we added a 'workaround' that solves the problem in a known case.
Please report here any further occurrences of the problem - if we have no news in the next few months we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant