Skip to content

Commit

Permalink
FIX Made the Portuguese tokenizer split in tabs.
Browse files Browse the repository at this point in the history
  • Loading branch information
andre-martins committed Dec 5, 2014
1 parent 99ed0ee commit 107a849
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
3 changes: 3 additions & 0 deletions python/tokenizers/portuguese/word_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ def tokenize(self, text):
# Note: the Portuguese sentence tokenizer should also do this!!
text = re.sub('\xc2\xa0', ' ', text)

# Replace tabs by spaces [ATM 3/12/2014].
text = re.sub('\t', ' ', text)

# Replace U+0096 by dashes.
text = re.sub('\xc2\x96', ' -- ', text)

Expand Down
2 changes: 1 addition & 1 deletion scripts_srl/evaluator
Submodule evaluator updated from b43ebd to 5a9318

0 comments on commit 107a849

Please sign in to comment.