GitHub - JudicaelPoumay/news-stopwords: A huge list of stopwords collected from millions of news articles

About

The file sw1k.csv contains most frequent 1,000 words and phrases occurred in more than 1.53 million news articles from 200+ sources collected over the span of a 2.5 years.

Similarily sw10k.csv and sw100k.csv contain 10,000 and 100,000 terms respectively.

Structure of the files

Each CSV file contains 5 columns:

term: the actual word or phrase

frequency: how many times a term has occured in all the documents

presence: in how many documents the term has occurred, note that frequence >= presence

doc_size_sum: sum of the size of the documents in which the term has occurred,

doc_size(d) = number of characters present in the doc d including whitespace

type: type of the term, N: Noun, NP: Noun Phrase, PN: Proper Noun, G: Other

types PERSON, PLACE, ORGANIZATION are self explanatory

Libs Sanford CoreNLP and OpenNLP were used to split the docs in sentences, NER and POS tagging

Notes

The dataset is not manually processed, so you might see some unusual terms such as some common emails, news site names, editor/author names etc. These can be easily filtered by keeping the words with higher frequency/presence ratio.
The dataset includes all type of frequent terms like proper nouns, places, orgs etc. One can filter these using the type column.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
sw100k.csv		sw100k.csv
sw10k.csv		sw10k.csv
sw1k.csv		sw1k.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Structure of the files

Notes

About

Releases

Packages

JudicaelPoumay/news-stopwords

Folders and files

Latest commit

History

Repository files navigation

About

Structure of the files

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages