upper case stop words for German texts #110

hennyu · 2017-07-05T14:27:00Z

Hello,
for a corpus in German, I tried to apply a custom stop word list with a mixture of lower case and upper case words. In the texts themselves, upper case is also retained. I used the following import statement:

bin/mallet import-dir --input 'corpus' --output 'mallet_out' --keep-sequence --token-regex '\p{L}[\p{L}\p{P}]*\p{L}' --stoplist-file 'stopwords_de.txt' --preserve-case TRUE

As a result, the lower case stop words from my list are removed, but the upper case stop words still appear in the topics. What should I do in order for all stop words to be removed?

The text was updated successfully, but these errors were encountered:

mimno · 2017-07-06T15:37:21Z

By default stoplists are not case sensitive, so "The" and "the" will both be removed if "the" is in the stoplist. This is implemented by comparing the toLowerCase() version of the input token to words in the stoplist, regardless of whether we're keeping the original case. You could therefore add "ihre" to the stoplist and it would match "Ihre".

There's an option in the API to create a case-sensitive stoplist if you are ok writing Java.

Adding an option for case-sensitive-stoplist would be a good longer-term solution that's worth considering.

hennyu · 2017-07-06T18:47:18Z

Thank you very much for your answer. It explains perfectly why the upper case stop words still are in the topics. I am fine with turning all the stop words into lower case. A new option for a stop word with upper case words would still be very nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upper case stop words for German texts #110

upper case stop words for German texts #110

hennyu commented Jul 5, 2017

mimno commented Jul 6, 2017

hennyu commented Jul 6, 2017

upper case stop words for German texts #110

upper case stop words for German texts #110

Comments

hennyu commented Jul 5, 2017

mimno commented Jul 6, 2017

hennyu commented Jul 6, 2017