Sentences within a dataset #6

gabrer · 2017-05-17T17:32:42Z

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

askerlee · 2017-05-18T06:15:12Z

The sentence information is actually not used. So it should not impact the performance. Do you mean that dots are part of the abbreviations? In this case you could modify the regular expression used to extract tokens from text.

…

On May 18, 2017 1:32 AM, "Gabriele Pergola" ***@***.***> wrote: I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases). Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJZSRxeInd8W9fti3r2NYk2JmSCibks5r6y86gaJpZM4NeNLw> .

gabrer · 2017-05-18T11:56:41Z

Oh, thank you for confirming this!
I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".

Thank you anyway!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentences within a dataset #6

Sentences within a dataset #6

gabrer commented May 17, 2017 •

edited

Loading

askerlee commented May 18, 2017 via email

gabrer commented May 18, 2017

Sentences within a dataset #6

Sentences within a dataset #6

Comments

gabrer commented May 17, 2017 • edited Loading

askerlee commented May 18, 2017 via email

gabrer commented May 18, 2017

gabrer commented May 17, 2017 •

edited

Loading