-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build_corpus
always removes words having a frequency below 5
#67
Comments
Maybe the simplest solution would be to simply tokenise and print to a .csv all of the words mentioned across the DocumentSet, from which we can refine for mentions of geographical locations? |
You can use the You can change it by using:
Which means a word is valid if it appears in at least one document (which is always the case) |
Thanks for this.
Unless I'm missing something the min_docs=x option doesn't seem to be changing anything about the output, see below where same output is given for both 'min_docs=1' and 'min_docs=10'


Thanks,
S
|
build_corpus
always removes words having a frequency below 5
This looks like a bug. I'll need to look into this. Thanks for reporting this! |
In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using
corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)
The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.
Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).
Thanks,
S
The text was updated successfully, but these errors were encountered: