Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence tokenization must ignore newline as whitespace in the default mode. #60

Open
sambitdash opened this issue Feb 26, 2021 · 0 comments · May be fixed by #62
Open

Sentence tokenization must ignore newline as whitespace in the default mode. #60

sambitdash opened this issue Feb 26, 2021 · 0 comments · May be fixed by #62

Comments

@sambitdash
Copy link

Several times the paragraphs have new lines copied from the source document (particularly when copied from PDF) and they should be ignored when sentences are tokenized.

This is the text taken from copying the text from a PDF document:

s = """
In this article, we present a language-independent, unsupervised approach to sentence boundary
detection. It is based on the assumption that a large number of ambiguities in the determination
of sentence boundaries can be eliminated once abbreviations have been identified. Instead of
relying on orthographic clues, the proposed system is able to detect abbreviations with high
accuracy using three criteria that only require information about the candidate type itself and
are independent of context: Abbreviations can be defined as a very tight collocation consisting
of a truncated word and a final period, abbreviations are usually short, and abbreviations
sometimes contain internal periods. We also show the potential of collocational evidence for
two other important subtasks of sentence boundary disambiguation, namely, the detection
of initials and ordinal numbers. The proposed system has been tested extensively on eleven
different languages and on different text genres. It achieves good results without any further
amendments or language-specific resources. We evaluate its performance against three different
baselines and compare it to other systems for sentence boundary detection proposed in the
literature."""

split_sentences(s)

"In this article, we present a language-independent, unsupervised approach to sentence boundary"
 "detection."
 "It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified."
 "Instead of"
 "relying on orthographic clues, the proposed system is able to detect abbreviations with high"
 "accuracy using three criteria that only require information about the candidate type itself and"
 "are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations"
 "sometimes contain internal periods."
 "We also show the potential of collocational evidence for"
 "two other important subtasks of sentence boundary disambiguation, namely, the detection of initials and ordinal numbers."
 "The proposed system has been tested extensively on eleven"
 "different languages and on different text genres."
 "It achieves good results without any further"
 "amendments or language-specific resources."
 "We evaluate its performance against three different"
 "baselines and compare it to other systems for sentence boundary detection proposed in the"
 "literature."
dhruvil410 added a commit to dhruvil410/WordTokenizers.jl that referenced this issue Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant