`build_corpus` always removes words having a frequency below 5 #67

SS159 · 2023-11-27T13:43:56Z

In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using

corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)

The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.

Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).

Thanks,

S

SS159 · 2023-11-27T13:45:38Z

Maybe the simplest solution would be to simply tokenise and print to a .csv all of the words mentioned across the DocumentSet, from which we can refine for mentions of geographical locations?

stijnh · 2023-11-27T15:01:33Z

You can use the min_docs=x option to specify that a word is only valid if it appears in at least x documents. By default min_docs=5.

You can change it by using:

build_corpus(...., min_docs=1)

Which means a word is valid if it appears in at least one document (which is always the case)

SS159 · 2023-12-04T09:44:21Z

Thanks for this. Unless I'm missing something the min_docs=x option doesn't seem to be changing anything about the output, see below where same output is given for both 'min_docs=1' and 'min_docs=10' ![image](https://github.com/NLeSC/litstudy/assets/60474063/03704bf4-878b-489f-9761-7586e97efd13) ![image](https://github.com/NLeSC/litstudy/assets/60474063/71aeb19a-4b6e-4958-9662-89013259ec31) Thanks, S

stijnh · 2023-12-07T10:08:12Z

This looks like a bug. I'll need to look into this. Thanks for reporting this!

stijnh added the bug Something isn't working label Dec 7, 2023

stijnh changed the title ~~Efficacy of Corpus Word Distribution~~ build_corpus always removes words having a frequency below 5 Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`build_corpus` always removes words having a frequency below 5 #67

`build_corpus` always removes words having a frequency below 5 #67

SS159 commented Nov 27, 2023

SS159 commented Nov 27, 2023

stijnh commented Nov 27, 2023 •

edited

Loading

SS159 commented Dec 4, 2023 via email •

edited

Loading

stijnh commented Dec 7, 2023

build_corpus always removes words having a frequency below 5 #67

build_corpus always removes words having a frequency below 5 #67

Comments

SS159 commented Nov 27, 2023

SS159 commented Nov 27, 2023

stijnh commented Nov 27, 2023 • edited Loading

SS159 commented Dec 4, 2023 via email • edited Loading

stijnh commented Dec 7, 2023

`build_corpus` always removes words having a frequency below 5 #67

`build_corpus` always removes words having a frequency below 5 #67

stijnh commented Nov 27, 2023 •

edited

Loading

SS159 commented Dec 4, 2023 via email •

edited

Loading