You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
for a corpus in German, I tried to apply a custom stop word list with a mixture of lower case and upper case words. In the texts themselves, upper case is also retained. I used the following import statement:
As a result, the lower case stop words from my list are removed, but the upper case stop words still appear in the topics. What should I do in order for all stop words to be removed?
The text was updated successfully, but these errors were encountered:
By default stoplists are not case sensitive, so "The" and "the" will both be removed if "the" is in the stoplist. This is implemented by comparing the toLowerCase() version of the input token to words in the stoplist, regardless of whether we're keeping the original case. You could therefore add "ihre" to the stoplist and it would match "Ihre".
There's an option in the API to create a case-sensitive stoplist if you are ok writing Java.
Adding an option for case-sensitive-stoplist would be a good longer-term solution that's worth considering.
Thank you very much for your answer. It explains perfectly why the upper case stop words still are in the topics. I am fine with turning all the stop words into lower case. A new option for a stop word with upper case words would still be very nice.
Hello,
for a corpus in German, I tried to apply a custom stop word list with a mixture of lower case and upper case words. In the texts themselves, upper case is also retained. I used the following import statement:
bin/mallet import-dir --input 'corpus' --output 'mallet_out' --keep-sequence --token-regex '\p{L}[\p{L}\p{P}]*\p{L}' --stoplist-file 'stopwords_de.txt' --preserve-case TRUE
As a result, the lower case stop words from my list are removed, but the upper case stop words still appear in the topics. What should I do in order for all stop words to be removed?
The text was updated successfully, but these errors were encountered: