Skip to content

babyakja/subreddit_nlp_tx_senate_race

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

NLP on the 2018 Texas Race using the Reddit API

Objective

Using the tools learned in General Assembley's Data Science Immersive course, we will collect data via an API request and then building a binary classification predictior. In particular, we will apply this in the realm of Reddit. Using NLP, what characteristics of a post on Reddit contribute most to which subreddit it belongs to? Our data will focus on Texas Senate race between Ted Cruz (R) and Beto O'Rourke (D).

Python Packages Used:

  • Pandas
  • Numpy
  • Scikit-Learn

Data Collection

Build function to scrape reddit with inputs of subreddit name and number of pages to pull Reddit's API is limited to the last 1000 post for any subreddit, so how active a subreddit is can determine how frequently it is necessary to pull pages.

Clean data into useful corpus and assign each with a label of it's subreddit source. This creates domain specific training data that then can be used to classify new data.

Modeling

Classification

  • Use GridSearch and Pipelines to find best predictive model Run through different model types and GridSearch hyperparameters to tune for best performance
  • Metric Targeted: Accuracy

Findings

We can apply our model to a different Reddit subreddit to gauge the pulse and discussion of untagged posts. We will reference r/Texas with politic flags to see what this 'neutral' space is discussing more in terms of candidates.

After supplying r/texas data into the model, 69% of posts were geared toward Beto while 31% was geared towards Cruz. This initially indicates an advantage for Beto being the center of discussion but the sentiment for each post needs to be factored in as well.

Conclusion

Although we had a model that performed at a accuracy of 94%, the context of the discussion was heavily negative for both candidates for classification with a high probability.

About

NLP project on the TX Senate Race between Beto and Cruz

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published