Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control characters break German segmentation #121

Open
edloginova opened this issue Jun 20, 2023 · 0 comments
Open

Control characters break German segmentation #121

edloginova opened this issue Jun 20, 2023 · 0 comments

Comments

@edloginova
Copy link

Describe the bug
Control characters like \x1f break German sentence segmentation at format_numbered_list_with_periods step.

To Reproduce
Steps to reproduce the behavior:
Input text - '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'

Code:

import pysbd
example_text = '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True)
sents_char_spans = segmenter.segment(example_text)      

Expected behavior
Expected output:
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']

Additional context
pysbd version:
'0.3.4'
Python 3.8.10
Windows/Linux both tried

Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│   1 segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True)                      │
│ ❱ 2 sents_char_spans = segmenter.segment(example_text)                                           │
│   3                                                                                              │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\segme │
│ nter.py:87 in segment                                                                            │
│                                                                                                  │
│   84 │   │   if self.clean or self.doc_type == 'pdf':                                            │
│   85 │   │   │   text = self.cleaner(text).clean()                                               │
│   86 │   │                                                                                       │
│ ❱ 87 │   │   postprocessed_sents = self.processor(text).process()                                │
│   88 │   │   sentence_w_char_spans = self.sentences_with_char_spans(postprocessed_sents)         │
│   89 │   │   if self.char_span:                                                                  │
│   90 │   │   │   return sentence_w_char_spans                                                    │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\proce │
│ ssor.py:33 in process                                                                            │
│                                                                                                  │
│    30 │   │   │   return self.text                                                               │
│    31 │   │   self.text = self.text.replace('\n', '\r')                                          │
│    32 │   │   li = ListItemReplacer(self.text)                                                   │
│ ❱  33 │   │   self.text = li.add_line_break()                                                    │
│    34 │   │   self.replace_abbreviations()                                                       │
│    35 │   │   self.replace_numbers()                                                             │
│    36 │   │   self.replace_continuous_punctuation()                                              │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:61 in add_line_break                                                           │
│                                                                                                  │
│    58 │   def add_line_break(self):                                                              │
│    59 │   │   self.format_alphabetical_lists()                                                   │
│    60 │   │   self.format_roman_numeral_lists()                                                  │
│ ❱  61 │   │   self.format_numbered_list_with_periods()                                           │
│    62 │   │   self.format_numbered_list_with_parens()                                            │
│    63 │   │   return self.text                                                                   │
│    64                                                                                            │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:80 in format_numbered_list_with_periods                                        │
│                                                                                                  │
│    77 │   │   │   │   │   │   '♨', strip=True)                                                   │
│    78 │                                                                                          │
│    79 │   def format_numbered_list_with_periods(self):                                           │
│ ❱  80 │   │   self.replace_periods_in_numbered_list()                                            │
│    81 │   │   self.add_line_breaks_for_numbered_list_with_periods()                              │
│    82 │   │   self.text = Text(self.text).apply(self.SubstituteListPeriodRule)                   │
│    83                                                                                            │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:76 in replace_periods_in_numbered_list                                         │
│                                                                                                  │
│    73 │   │   self.text = Text(self.text).apply(self.ListMarkerRule)                             │
│    74 │                                                                                          │
│    75 │   def replace_periods_in_numbered_list(self):                                            │
│ ❱  76 │   │   self.scan_lists(self.NUMBERED_LIST_REGEX_1, self.NUMBERED_LIST_REGEX_2,            │
│    77 │   │   │   │   │   │   '♨', strip=True)                                                   │
│    78 │                                                                                          │
│    79 │   def format_numbered_list_with_periods(self):                                           │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:114 in scan_lists                                                              │
│                                                                                                  │
│   111 │                                                                                          │
│   112 │   def scan_lists(self, regex1, regex2, replacement, strip=False):                        │
│   113 │   │   list_array = re.findall(regex1, self.text)                                         │
│ ❱ 114 │   │   list_array = list(map(int, list_array))                                            │
│   115 │   │   for ind, item in enumerate(list_array):                                            │
│   116 │   │   │   # to avoid IndexError                                                          │
│   117 │   │   │   # ruby returns nil if index is out of range                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: '\x1d2'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant