You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).
Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?
PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?
The text was updated successfully, but these errors were encountered:
The sentence information is actually not used. So it should not impact the
performance. Do you mean that dots are part of the abbreviations? In this
case you could modify the regular expression used to extract tokens from
text.
On May 18, 2017 1:32 AM, "Gabriele Pergola" ***@***.***> wrote:
I am working on a dataset quite "noisy", so it's very difficult to exactly
detect a sentence (for example, I have a lot of abbreviation with points,
so these points are detected as the end of phrases).
Do you think that having many short sentences (often with just 3 words)
could compromise the algorithm performances? Is it important to preserve
the information about words belonging to a sentence?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABgKJZSRxeInd8W9fti3r2NYk2JmSCibks5r6y86gaJpZM4NeNLw>
.
I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).
Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?
PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?
The text was updated successfully, but these errors were encountered: