We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug Control characters like \x1f break German sentence segmentation at format_numbered_list_with_periods step.
\x1f
format_numbered_list_with_periods
To Reproduce Steps to reproduce the behavior: Input text - '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
'1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
Code:
import pysbd example_text = '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana' segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True) sents_char_spans = segmenter.segment(example_text)
Expected behavior Expected output: ['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
Additional context pysbd version: '0.3.4' Python 3.8.10 Windows/Linux both tried
Traceback (most recent call last) ────────────────────────────────╮ │ in <module> │ │ │ │ 1 segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True) │ │ ❱ 2 sents_char_spans = segmenter.segment(example_text) │ │ 3 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\segme │ │ nter.py:87 in segment │ │ │ │ 84 │ │ if self.clean or self.doc_type == 'pdf': │ │ 85 │ │ │ text = self.cleaner(text).clean() │ │ 86 │ │ │ │ ❱ 87 │ │ postprocessed_sents = self.processor(text).process() │ │ 88 │ │ sentence_w_char_spans = self.sentences_with_char_spans(postprocessed_sents) │ │ 89 │ │ if self.char_span: │ │ 90 │ │ │ return sentence_w_char_spans │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\proce │ │ ssor.py:33 in process │ │ │ │ 30 │ │ │ return self.text │ │ 31 │ │ self.text = self.text.replace('\n', '\r') │ │ 32 │ │ li = ListItemReplacer(self.text) │ │ ❱ 33 │ │ self.text = li.add_line_break() │ │ 34 │ │ self.replace_abbreviations() │ │ 35 │ │ self.replace_numbers() │ │ 36 │ │ self.replace_continuous_punctuation() │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:61 in add_line_break │ │ │ │ 58 │ def add_line_break(self): │ │ 59 │ │ self.format_alphabetical_lists() │ │ 60 │ │ self.format_roman_numeral_lists() │ │ ❱ 61 │ │ self.format_numbered_list_with_periods() │ │ 62 │ │ self.format_numbered_list_with_parens() │ │ 63 │ │ return self.text │ │ 64 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:80 in format_numbered_list_with_periods │ │ │ │ 77 │ │ │ │ │ │ '♨', strip=True) │ │ 78 │ │ │ 79 │ def format_numbered_list_with_periods(self): │ │ ❱ 80 │ │ self.replace_periods_in_numbered_list() │ │ 81 │ │ self.add_line_breaks_for_numbered_list_with_periods() │ │ 82 │ │ self.text = Text(self.text).apply(self.SubstituteListPeriodRule) │ │ 83 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:76 in replace_periods_in_numbered_list │ │ │ │ 73 │ │ self.text = Text(self.text).apply(self.ListMarkerRule) │ │ 74 │ │ │ 75 │ def replace_periods_in_numbered_list(self): │ │ ❱ 76 │ │ self.scan_lists(self.NUMBERED_LIST_REGEX_1, self.NUMBERED_LIST_REGEX_2, │ │ 77 │ │ │ │ │ │ '♨', strip=True) │ │ 78 │ │ │ 79 │ def format_numbered_list_with_periods(self): │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:114 in scan_lists │ │ │ │ 111 │ │ │ 112 │ def scan_lists(self, regex1, regex2, replacement, strip=False): │ │ 113 │ │ list_array = re.findall(regex1, self.text) │ │ ❱ 114 │ │ list_array = list(map(int, list_array)) │ │ 115 │ │ for ind, item in enumerate(list_array): │ │ 116 │ │ │ # to avoid IndexError │ │ 117 │ │ │ # ruby returns nil if index is out of range │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: invalid literal for int() with base 10: '\x1d2'
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
Control characters like
\x1f
break German sentence segmentation atformat_numbered_list_with_periods
step.To Reproduce
Steps to reproduce the behavior:
Input text -
'1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
Code:
Expected behavior
Expected output:
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
Additional context
pysbd version:
'0.3.4'
Python 3.8.10
Windows/Linux both tried
The text was updated successfully, but these errors were encountered: