Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in German splitting with parenthesis #120

Open
kongyurui opened this issue Apr 1, 2023 · 0 comments
Open

Bug in German splitting with parenthesis #120

kongyurui opened this issue Apr 1, 2023 · 0 comments

Comments

@kongyurui
Copy link

Describe the bug
When an open parenthesis appears in certain situations in German text, it can cause a crash when running sentence splitting.

To Reproduce

from pysbd import Segmenter

text = 'auf der Suche nach Einsätzen als Skilehrer im DACH-Raum. Langjährige Erfahrung im Leiten von Gruppen diverser Altersgruppen und Sportarten. B.A. Sport und Gesundheit in Prävention und Therapie (Deutsche Spothochschule Köln) Zertifikate: Erste Hilfe, DRK Rettungsschwimmer silber, DSHS Fitnesstrainer B(asic) Lizenz, Aquafitness Instructor, Progressive Muskelentspannung'

de_split = Segmenter(language='de')

de_split.segment(text)

This crashes at

File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements
txt = re.sub(r'(?<={am}).(?=\s)'.format(am=am), '∯', txt)

Expected behavior
Segments text

Additional context
Crash due to sequence: B(a

Suggested fix: Add

        am = re.escape(am)

to deutsch.py in scan_for_replacement

Traceback (most recent call last): File "german_fix.py", line 8, in de_split.segment(text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/segmenter.py", line 87, in segment postprocessed_sents = self.processor(text).process() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 34, in process self.replace_abbreviations() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 180, in replace_abbreviations self.text = self.abbreviations_replacer().replace() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 66, in replace self.text = self.search_for_abbreviations_in_string(self.text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/abbreviation_replacer.py", line 92, in search_for_abbreviations_in_string text = self.scan_for_replacements( File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements txt = re.sub(r'(?<={am})\.(?=\s)'.format(am=am), '∯', txt) File "/usr/lib/python3.8/re.py", line 210, in sub return _compile(pattern, flags).sub(repl, string, count) File "/usr/lib/python3.8/re.py", line 304, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.8/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.8/sre_parse.py", line 948, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "/usr/lib/python3.8/sre_parse.py", line 759, in _parse raise source.error("missing ), unterminated subpattern", re.error: missing ), unterminated subpattern at position 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant