You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When an open parenthesis appears in certain situations in German text, it can cause a crash when running sentence splitting.
To Reproduce
from pysbd import Segmenter
text = 'auf der Suche nach Einsätzen als Skilehrer im DACH-Raum. Langjährige Erfahrung im Leiten von Gruppen diverser Altersgruppen und Sportarten. B.A. Sport und Gesundheit in Prävention und Therapie (Deutsche Spothochschule Köln) Zertifikate: Erste Hilfe, DRK Rettungsschwimmer silber, DSHS Fitnesstrainer B(asic) Lizenz, Aquafitness Instructor, Progressive Muskelentspannung'
de_split = Segmenter(language='de')
de_split.segment(text)
This crashes at
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements
txt = re.sub(r'(?<={am}).(?=\s)'.format(am=am), '∯', txt)
Expected behavior
Segments text
Additional context
Crash due to sequence: B(a
Suggested fix: Add
am = re.escape(am)
to deutsch.py in scan_for_replacement
Traceback (most recent call last):
File "german_fix.py", line 8, in
de_split.segment(text)
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/segmenter.py", line 87, in segment
postprocessed_sents = self.processor(text).process()
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 34, in process
self.replace_abbreviations()
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 180, in replace_abbreviations
self.text = self.abbreviations_replacer().replace()
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 66, in replace
self.text = self.search_for_abbreviations_in_string(self.text)
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/abbreviation_replacer.py", line 92, in search_for_abbreviations_in_string
text = self.scan_for_replacements(
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements
txt = re.sub(r'(?<={am})\.(?=\s)'.format(am=am), '∯', txt)
File "/usr/lib/python3.8/re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.8/sre_parse.py", line 759, in _parse
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0
The text was updated successfully, but these errors were encountered:
Describe the bug
When an open parenthesis appears in certain situations in German text, it can cause a crash when running sentence splitting.
To Reproduce
from pysbd import Segmenter
text = 'auf der Suche nach Einsätzen als Skilehrer im DACH-Raum. Langjährige Erfahrung im Leiten von Gruppen diverser Altersgruppen und Sportarten. B.A. Sport und Gesundheit in Prävention und Therapie (Deutsche Spothochschule Köln) Zertifikate: Erste Hilfe, DRK Rettungsschwimmer silber, DSHS Fitnesstrainer B(asic) Lizenz, Aquafitness Instructor, Progressive Muskelentspannung'
de_split = Segmenter(language='de')
de_split.segment(text)
This crashes at
File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements
txt = re.sub(r'(?<={am}).(?=\s)'.format(am=am), '∯', txt)
Expected behavior
Segments text
Additional context
Crash due to sequence: B(a
Suggested fix: Add
to deutsch.py in scan_for_replacement
Traceback (most recent call last): File "german_fix.py", line 8, in de_split.segment(text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/segmenter.py", line 87, in segment postprocessed_sents = self.processor(text).process() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 34, in process self.replace_abbreviations() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 180, in replace_abbreviations self.text = self.abbreviations_replacer().replace() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 66, in replace self.text = self.search_for_abbreviations_in_string(self.text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/abbreviation_replacer.py", line 92, in search_for_abbreviations_in_string text = self.scan_for_replacements( File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements txt = re.sub(r'(?<={am})\.(?=\s)'.format(am=am), '∯', txt) File "/usr/lib/python3.8/re.py", line 210, in sub return _compile(pattern, flags).sub(repl, string, count) File "/usr/lib/python3.8/re.py", line 304, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.8/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.8/sre_parse.py", line 948, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "/usr/lib/python3.8/sre_parse.py", line 759, in _parse raise source.error("missing ), unterminated subpattern", re.error: missing ), unterminated subpattern at position 0
The text was updated successfully, but these errors were encountered: