Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentences within a dataset #6

Open
gabrer opened this issue May 17, 2017 · 2 comments
Open

Sentences within a dataset #6

gabrer opened this issue May 17, 2017 · 2 comments

Comments

@gabrer
Copy link

gabrer commented May 17, 2017

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

@askerlee
Copy link
Owner

askerlee commented May 18, 2017 via email

@gabrer
Copy link
Author

gabrer commented May 18, 2017

Oh, thank you for confirming this!
I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".

Thank you anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants