Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upper case stop words for German texts #110

Open
hennyu opened this issue Jul 5, 2017 · 2 comments
Open

upper case stop words for German texts #110

hennyu opened this issue Jul 5, 2017 · 2 comments

Comments

@hennyu
Copy link

hennyu commented Jul 5, 2017

Hello,
for a corpus in German, I tried to apply a custom stop word list with a mixture of lower case and upper case words. In the texts themselves, upper case is also retained. I used the following import statement:

bin/mallet import-dir --input 'corpus' --output 'mallet_out' --keep-sequence --token-regex '\p{L}[\p{L}\p{P}]*\p{L}' --stoplist-file 'stopwords_de.txt' --preserve-case TRUE

As a result, the lower case stop words from my list are removed, but the upper case stop words still appear in the topics. What should I do in order for all stop words to be removed?

@mimno
Copy link
Owner

mimno commented Jul 6, 2017

By default stoplists are not case sensitive, so "The" and "the" will both be removed if "the" is in the stoplist. This is implemented by comparing the toLowerCase() version of the input token to words in the stoplist, regardless of whether we're keeping the original case. You could therefore add "ihre" to the stoplist and it would match "Ihre".

There's an option in the API to create a case-sensitive stoplist if you are ok writing Java.

Adding an option for case-sensitive-stoplist would be a good longer-term solution that's worth considering.

@hennyu
Copy link
Author

hennyu commented Jul 6, 2017

Thank you very much for your answer. It explains perfectly why the upper case stop words still are in the topics. I am fine with turning all the stop words into lower case. A new option for a stop word with upper case words would still be very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants